 Hi, everybody. My name is David. I work at META with BPF folks and on the scheduler. And I want to talk about multi-K func sets, which is a horrible name for what's essentially scoping specific K funcs to only be called from specific struct ops callbacks. So here's the agenda. I'm going to try to speed run this a little bit to get us a little back on track. To give a little background, struct ops are a way for BPF programs to implement callbacks that can be invoked from the kernel to implement certain functionality in the kernel. So we have, for example, HID BPF, which we've spoken about uses this TCP congestion control. And then for SCETI XT, which is the BPF scheduler that we've all been talking about, the BPF program implements callbacks that are invoked from the core scheduler. For example, in queuing a task and the scheduler and stuff like that. Just like the kernel can call into struct ops, struct ops can call back into the main kernel using K funcs. On the prior slide, we had this example in queue program right here, which is, again, the in queue callback. Well, that can call K funcs. The example that I gave here is an RCU lock. Sorry if the text is kind of small. But the point is you can call back into the main kernel using K funcs from struct ops callbacks. So what's the feature? Basically, it's that K funcs aren't safe to call in every context. So right now, we specify, when you register K funcs, you can specify what program type those K funcs can be invoked from. So you can scope K funcs to only be invoked from struct ops programs. But you can't specify exactly which callbacks you care about. So for example, for SCETI XT, we have this SCX BPF dispatch. Can you can you see this or is this too small? It's good. Okay. We have this SCX BPF dispatch K func. And I won't go into all the details about what that does now. We'll talk about that in the SCETI XT presentation on Wednesday. But basically that's you taking a task that was in queue to the BPF scheduler and putting it onto a dispatch queue or putting it right onto a CPU. And so obviously, from certain callbacks that would make no sense. If a task was waking up, there's this ops.select CPU callback instruct ops where you decide which run queue or which CPU it should be migrated to before it's in queue. And it can actually be unsafe to call certain K funcs because maybe the kernel is going to set some state call the the struct ops callback and expect that state to be set if it gets called back into from BPF. And then lastly, it gets kind of complicated because K funcs can nest. So you can you can have a struct ops callback get invoked by the kernel. And then you invoke another another K func, which itself calls a struct ops callback. Or you know, there's like a timer interrupt that goes off and you get another struct ops callback called in the timer interrupt where you get another K func invoke. So there's a kind of delicate balance of of when you can call K funcs. You can deadlock yourself if you're not careful. You can have, you know, invalid memory also all sorts of things can happen. And so that's the request. We need to figure out a way to restrict K funcs to specific struct ops callbacks in specific contexts and more generally, if it ends up making sense to do that. We ideally could also maybe specify how K funcs can be nested. And it would be nice to be able to do that statically, because I think, you know, it really probably should be statically defined if if you can have a circular circular dependency, then you could potentially deadlock no matter what. And yeah, so let me just see here. Another nice feature, which isn't really as pressing, but would be would be nice is if you could have K funcs, which in different contexts in a BPF program actually called out to different functions in the main kernel. For example, I'll again talk about this when I talk about SCET EXT, but you can actually dispatch from different contexts. In one of the contexts, you can you can send a task to a different CPU. And in another context, you can only directly dispatch it to your own CPU. So it might give us a chance to simplify the kernel if we could we could specify, you know, which implementation should do what. And this is I think this is really important for struct ops and K funcs specifically because the kind of goal, the direction that we're going in for this at least use case for BPF is we want to allow developers elsewhere in the kernel that aren't, you know, core BPF developers to be able to integrate BPF into their subsystem and sort of gracefully extend use BPF to sort of implement different functionality instead of having to take you API dependencies or whatnot. And, you know, right now any K funcs they add you can call from your your own struct ops callback and it's totally unsafe. So I think it's kind of an important feature. Okay, so how do we do it? What ideas do I have? Okay, is that visible? I'm going to assume it is. In skedxt, we defined a set of a set of flags, which you statically specify when you in like a macro when you perform the K func. And in that macro we call some function in skedxt, which does a static check to decide if you're actually allowed to be calling it in that context. So you call, you know, I'm going to call this in queue callback. This is the map. This is the mask of context where it's allowed to be invoked from. We record a mask when we do that the flags. And then if you call another K func later on and you know, you're invalid doing an invalid nest or something like that, we'll do a warrant a warrant so that we can figure it out. And so obviously the implication here is this is all static. And we should never we should never hit this warning if we're doing things correctly. And this is an example of Yeah, this is the where we check whether it's allowed or not. We have this mask here, which is again, corresponds to one of these. And yeah, you know, that's the basic idea. So this is just how we did it to be able to unblock skedxt from, you know, from being dependent on this feature, because again, it would be unsafe to use otherwise. But we can do other things, you know, when we when we generalize this to to BPF more broadly. Yeah, so we use a bit and global mass attract which K funcs can be invoked. And then in the K func again, just to be clear, that you would call this this KF allowed function and return rejected if you weren't allowed to be called from that context. So for SCX BPF dispatch number slots, if you're calling it from the wrong context, you will be able to make the K func call from the BPF program currently, but we'll reject it here and then we'll kick out the scheduler because it was doing something that shouldn't be doing. Okay, so this is I'm trying to remember one by this slide. Yeah, so when we specify, okay, I don't even remember, I'll look at this later. I'm sorry. So that's the basic that's the basic feature. Does anybody have any any thoughts or questions? This one in the back. Yeah, there's a patch set in a DT about the BPF salt destroy. So so one of the requirement that is only limited for only limit a K func to a particular expected attached type. So one of the patch that has a added filter light of call back during the verify to work. So what that filter does is it has it has a BPF program pointer. So you can try for why not only try for the expected attached type to that type, but potentially it could filter more things. But what you're suggesting here is more like runtime, putting it in the current task. Yeah, to track it in the runtime, which is which is certainly not preferable. Because yeah, I mean, this is it's all statically verifiable. So you should we should be able to do it. I think at that load time. That's some of the, let's say that block, right? I think you need you need to check whether like K fun a has been called before, before you allow it to call K fun a again, that will be something that need to be chatting runtime or in verify can also check that also. Yeah, yeah, that's true. Yeah, I mean, I don't know how we would do it at. I haven't thought really about how we would represent it such that we could be sure to verify nested contexts and whatnot in the verifier. For something like for callbacks, for example, that you would never expect to be invoked, or K funs, you would never expect to be invoked in like a non sleep in a sleepable context. Potentially, we could do some stuff like that. Like, okay, so there's, there's a few K funs that, like in the init callback and whatnot and sleepable where, you know, you can you can expect basically everything else to be nested because if a timer interrupt goes off and calls into the scheduler, you know, there's a ton of K funs we could take from there. So that boundary, I think it's probably not too bad. But, you know, even within the sleepable boundary, there's there's there's nesting rules, or rather, there's different contexts, I don't think you can nest for that. But there's different contexts, where certain K fun should be allowed and certain one shouldn't. But um, but yeah, I mean, having a I don't quite I'm not quite following the the explanation you gave about specifying the filters, because I haven't seen the code, but I'll take a look at it and see if we could if we could leverage that. Yeah, it's a good suggestion. Okay. All right. Thanks, everybody. Sorry, quick question. Is that is that online? No, it's back there. Maybe this is like, you know, when all you have a hammer, everything was like a nail thing. But since we've been talking about btf tags a lot today, it just dawned on me that maybe btf tags could be used for this somehow. Because the struct ops definition, you can tag arbitrary functions there. arbitrary struct ops callbacks pointers there, and also tag the k funks with something similar. It's an interesting suggestion. I mean, I think you could you could use that to specify which specific callbacks you're allowed to be invoked from. But if we wanted to also have verification for like proper nesting and whatnot, I think yeah, I think that would have to be done at runtime. But maybe yeah, for for for just statically defining which which callbacks you can be invoked from. That's that's a possibility for sure. Honestly, I mean, I think we need something in the short term. It's not not critical, but but it's it's it will put a lot less burden on people implementing k funks if we have something like this that you don't have to you don't have to protect yourself and be correct. Okay, awesome. So the next one is is a pretty harebrained even more than more so than that one. It's it's kind of interesting. And it goes a little bit towards usability, which we were alluding to earlier. So the feature is called local storage user space mapping. And as you probably guessed, yeah, and it's it's this will be very complex and it needs a lot more thoughts and details. And so I wanted to talk about it in the room, because I wanted to get your thoughts and feedback on it. But it needs a lot more, you know, really thinking through all the implications and whatnot for this to work. But yeah, so we have maps, local storage maps, where if you have a kernel object, currently, I may be missing one, but currently, if you have a task struct, or you have a C group struct, you can define a map type, where you can implement storage for each of those types. So for example, in sketext, if you have a task struct that's passed to the in queue callback, you could have a task storage or a BPF map type task storage map, where you store something for that task. And then you can pass the task as a key, get back a pointer to some, some, some memory and then and then query or write to that memory. And so this is this is what the map would look like if we have like a bull, bullion will reinforce the the task to be local, meaning stay on the current CPU. And then we have a per task CPU mask, which specifies which CPUs we want the task to be able to run on. And we could get that by by querying it. And this is an example of how you can create create the the storage you can pass this this flag to BPF task storage get and then a pointer to the map and then it'll return a pointer to the to the storage if you're able to allocate it. And here's an example of it actually being used. Okay, yeah, great. So here we check if we on the left side there if we we can find an idle CPU for for the tasks to run on. We set force local and then later on on the right side in the queue callback. If that flag is set, then we just keep it on the current CPU. And we unset that the flag. So we have for next time. So it's a really it's a very useful API, especially for Sketty XT. It's been extremely useful. In terms of how it's implemented, I won't go into too much detail because I don't fully understand it. But the gist of it is that we have in the task struct and we have another one in the C group struct, we have this BPF local storage object. And in that object, we have a hash map where we cash, we cash the the local storage entries for the for that task. Otherwise, we have a list where we store all of them if it's not in the none of the cash because the idea is, if you have, you know, like, let's say, on the extreme case, you have like a thousand local storage entries because you have all these different maps for the task, you'd have a cash that you didn't have to do an O of n look up every time. And yeah, that's that's how it works. Don't need to look at the allocator. Okay, so the feature would be and then the last thing I should say which I didn't put in the presentation is that you can't access us from user space at all, at least as far as I understand. This is only for use in the kernel program directly. But that being said, a lot of people need this to access it from user space. So in at Google and Ghost, they use something, they define apparently this giant map type, this giant array, excuse me, map, which contains, you know, enough entries for every task on the system. And if they want to set some per task context from user space, they just index and do it by PID and set the set the context. We use something similar in Sketty XT and our kind of main scheduler, which you call Atropo, which is a load balancer and rust in user space. We set state in a hash map, essentially, which is indexed by PID. And then in kernel space, you do a lookup and that informs if you should load balance different CPUs and stuff like that. So the idea is, you know, there's ways around it. You can use, you can use array maps, you can use hash maps. What you really want is the map type that kernel space is using. So, yeah, you can use sadically sized maps, but it's wasteful. And I mean, the point of having these C group storage and task storage so that you don't have to, don't have to do this kind of thing. Yeah, so how would we do it though? I mean, it's a very, very difficult problem, I think, because, you know, we don't really have, as far as I know, like variable mappings in libbpf yet. And, you know, right now we're putting the storage directly in the task. So it would be an entirely different approach. The only way that I can think of for now is to have something like a local storage allocator, where for an instance, John, did you want to say something? Sure. I guess, did you mind if I interrupt your talk? No, okay, cool. Just one comment, I guess, about the size of the pre-allocated. I was worried about this for a long time, but then I realized, depending on how big your objects are, even if your objects are a couple hundred bytes, right, that you're sticking in there, a four megabyte hash map can fit, you know, 32,000 pids. So like, I mean, if you have, so my point is I was really worried about the size of that map for a long time, it's actually not that big, right? Well, I mean, okay, yes, it depends on, context depends on context, right? I mean, in some context, on a server, four megs is probably not going to be a big deal. You know, if you want BPF to also be able to run on like VR devices and stuff like that, it might be less applicable. I agree though, I mean, it's not like you're taking up gigs in memory or anything like that. Yeah, so that was all, it was just like a side comment. Yeah, yeah, that's a good point though. And I think, yeah, for me, like the main thing is clearly this is something that would be useful to people, right? Like people are using it, we have per task storage in the kernel and we have per C group storage for a reason. And so it feels like if we can kind of connect those two, and somehow use it from user space, we would be addressing, we'd be addressing these use cases and kind of unifying the intention of the those local storage map types. But maybe it's not worth it because it's really complicated. I think you would have to have a local storage allocator where when you create a local storage entry in a map, we have an allocator that will allocate pages lazily and put these entries, you know, basically jam them onto these pages, which are then mapped from user space. I don't know how we, I don't think we would want to do any kind of like runtime remapping in libvpf, so I imagine that we would have to pre pre map some range of memory, make some of most of the pages unused and then as more entries are added, we would have to change the page table entries in a due TLB flush in order for user space to use them, and then otherwise they fault or we like zero map it or something like that. So can I, sorry, I don't know if you can, is the, is the root problem that you need a storage space that grows with the problem? Are you trying to build like a, like a, I guess I'm missing what the local part of that storage is. Is it just that you need a storage block that you can grow as time goes on or am I missing something? So the local part is, is that it's per task or per C group. So if you, so in this example, let's say that we did this out this harebrained allocator that I'm talking about. If you had 20 task local storage entries, that would be 20 entries across 20 different tasks and you would jam those onto the minimum number of pages that you need and that, that extent would be what was mapped from, from user space. And in order to do that, you would have to have some kind of IDR layer from user space because now you're not indexing, you're not just doing an index by the pit or you're not doing like an, you know, you're not just doing a lookup on the task. So you would need some integration with libvpf where you would be able to map a pitter or, you know, something to, to the offset into this mapping that corresponds to the local storage entry for, for that task. I mean, is the problem that you can't read the local task storage from user space then? You can't read it, yeah. I mean, this is what you're trying to solve. It's embedded inside the task, so you can't, yeah. I'm actually, one comment we can, you know, but, but, but you need a I'm not a task guy, but, but you, you need a pidfd, I don't know what it is, but, but you need a p, what they call pidfd to get to the, it's a, it's a regular map lookup, right? So task local storage is a, it's a map. So, so you say bpfmap lookup element, the key will be the pidfd. And you do that for every single task. Yeah. So that's the thing. I mean, this, this is also being done on like, you know, on every, like, scheduling. So for ghosts, they're actually, they're actually running the scheduler in user space for every task. So it's, I guess you're right. It is possible to map it from user space. I wasn't aware that you could do that. But a, we probably want to have writable entries from user space. So for, for the use case for ghosts and skedxt, that would be applicable. It is writable also. It is? Yeah. But how can you use a bpf iterator and for all this local storage, we already have sk local storage, you can get all the up to date information. And otherwise I would exceed consider and the task command to go, socket command to go, it will be really complicated to manage this in sync kernel. How would you, how would you do a writable mapping if you have like a k point or something? No, no, you don't, you, you don't do the writable mapping and you can just use an existing interface and do an update with FD, for example, the task FD and the PID FD and you should be able to write some information. Do you have to make a system call to do the right? Yeah. Okay, so yeah, I mean that's, so it's, it's not, it's not quite the same thing, right? It's, it's not like the same thing a little bit less, less performance and because memory map much faster, what I see just memory map seems more complex. Yeah, I know it's, see the second slide in the recent day, it's definitely complex. I mean it's, and it might not be worth doing, it's just the, the truth is that, there's a comment, comment back there. I'll say something while, oh you're good. The truth is that it's, it's not like it would never be okay for ghost or for atropode to be using like an iterator where you're making this call repeatedly or like doing something like that, I mean it would just never, it would never be usable in production I think with that kind of overhead. Okay, he said with, with iterators you can specify a parameter with an fd and just poke into that specific one. Okay, well I'll let you talk in the next slide. Is what you're suggesting implying that for every release of a task extract, for example, would you have to do a TLB suit down? Because like if you need to unmap the page, right, and whenever a task is released you would have to unmap a full page. So you would have to deal with, well so it wouldn't be one task per page, it would be multiple TLS entries per page. So you would need to do also a copy and write. You would need to, you would, you would probably need to do a TLB shoot down if you were like decreasing, like defragmenting or something and like, okay, like all of the entries in this page are gone, so we're going to free this page, you would have to do a shoot down so that the page wasn't mapped anymore. But if the task extract is freed, like you cannot keep it mapped, right? The task extract is not mapped, like so these entries are separate storage altogether, right? And yeah, so in the task extract, that's where we have the list head, but the actual storage is just allocated with a BPF mem alloc. How is it, this question? Then when a task dies, do you create a holes in your page? Yeah, we have to figure that out. So there's a, there's a few issues that I also didn't talk about, which is again the fragmentation. So you, we would eventually probably want to get rid of all the holes in the page and, and you know aggregate them. I don't know how you would synchronize with user space to do that. Andre? How frequently do you plan to access this task local storage? Very, very, very frequently. Like for ghosts, they're doing it literally for every in queue operation, like for multiple different scheduling operations. For a, for a SCETI XT, we do load balancing fairly infrequently, so it wouldn't be quite as important. It's like you can configure it, but every couple of seconds is when it's accessed. So it's not as bad. I mean, again, it's, it's doable in Martin as a comment. It's doable with a map type array. It's, it's fine. It's just this is another, like this is another example of multiple people kind of hacking around things, if you want to call it a hack with the same thing, Martin. So, so if I understand correctly, the use cases like once in a while, you will want to update the task local storage for every task in this, in this task local storage map. You do have to do iterate. Yeah, you have to iterate frequently, especially if you're doing load balancing, or you just want to do a quick look up. If like, if you're just doing it in queue for a specific task, you would just want to do like a read and then potentially write, you know, like which CPU should it be dispatched to or something like that. So it's, it's sort of both use cases. So if, if update often and it's more like a bachelor operation or maybe one comment would be like this, this socket map this local storage socket map iterator already. So it, it, it iterates, it run BP program to all the socket in this map. So maybe something similar could be done for the task local storage. Yeah. I mean, maybe it's a, it's hard to say. I mean, again, this is the scheduling path, right? So it's like, you're trying to get the scheduling decision done within like tens of microseconds at most. So having round trips to user, to, to user, between user space and kernel space could be problematic, but, but maybe not. It's, it's something worth exploring for sure. So, so if I understand correctly, the, the problem is that the interface for reading, to read and write is a syscall and that's too slow because you want to do stuff a lot from user space. That's right. Yeah. So why is not more of it in BPF, I guess? Like what's the thing that's preventing moving these parts that need a lot of like iterating or updating weights and stuff, why doesn't that exist in BPF where the overhead is a lot lower? That's a great question. So the reason that it doesn't exist in BPF for ghost is because ghost is, is specifically a user space scheduling framework. In general, so for SCET-EXE, for Atropo, the, the one that we were talking about, we do only load balancing in user space and you could do it in BPF, but the idea, we're trying to find like a layer where we push the really complex stuff out of BPF and into user space when it's not run quite as frequently. And then in, in BPF we do like the hot paths like in queue and stuff like that. So they, they really need like the extremely fast lookups in user space for, for ghost. But even then, I mean, we like, we, you can't like the correct abstraction for, for, for SCET-EXE, which isn't being done very, very frequently. In my opinion would be using the, the local storage map type, right? Because that's, that's like what the map is sort of designed for. So it's, it's not really performance as much for that case. All right. Okay. I, last comment. Yeah. For the BPF iterator, you can do a lot of aggregation in your BPF program and summarize all crucial results in a map, for example, or per CPU or the global variable, global array, and then user space can use that global array with your, your partial analysis results and in user space to make decisions. So you have two layers, right? You have a global array and the BPF program will iterate all the tasks and then give some partial information in this global array. It's a memory mapped and the user space can use that global map memory mapped stuff and to do a quick decision. Is that map writable as well? I think so. Okay. Read the writeable, right? It's a global variable, right? Yeah. I mean, again, like, like, like a globally, or excuse me, like statically sized arrays. Yeah. It should be also readable, writable, and then BPF program has access to that global array as well. Yeah. It's worth exploring. Okay. Thanks, everybody. Thank you.