 Yeah. Hey everybody. As a reminder, I'm David. I work at Meta. Today, we're going to be talking about a project that we've been working on for about a year and a half or so at Meta called SCETI XT. I'll go over first what it is, what the APIs look like, how do you use it. Then we'll also talk about why we need it and what kind of benefits you've gotten from it at Meta. The kind of meat of the discussion is probably going to be the middle, common objections. Most people that really dislike it or make some very nervous usually say one of two or three things. As a community, I think there will hopefully be some things we can kind of align on as far as how to address some of these objections. Then we'll talk about some interesting changes to BPF that we had to make if we have time. That's not really super necessary for the discussion, but there's a lot of good material there too. What is SCETI XT? It's a new scheduling class in the kernel, a new scheduling policy in the kernel that lets you implement scheduling policies in BPF. The kind of base abstraction for a scheduling policy directly in the kernel is a SCED class. There's a whole bunch of callbacks you have to implement. They're extremely complicated. You have to know the entire internal implementation of the actual core scheduler to implement this struct. The abstractions are very leaky. We implemented a new SCED class that uses a struct ops BPF program so that you can implement the policy in BPF. A much more straightforward self-contained interface, we hope. Let's take a look. Let's first start by why are we doing this? There's a few reasons that we want to do this. Hardware is getting pretty interesting, as we all know. AMD is an example released a chip a few months ago that has two CCXs, two L3 caches on a single socket. One of the two CCXs has a 3DV cache sitting on top of it, which means that one of the two CCXs has good cache locality but poor thermal throttling because they can't dissipate the heat out of the 3DV cache. The other one has worse cache locality, but you can more aggressively power the cores so they can run at a higher frequency. This is an extremely difficult scheduling problem to get right. It regresses lots of games on Steam by 50% if they're highly parallel and CPU bound on Linux, on Windows, it performs better. I don't know why, but maybe you can tell us. We want to be able to experiment quickly. It's very difficult to experiment in the core scheduler itself. Like I said, you have to understand the entire implementation, tens of thousands of extremely complex logic to be able to do that, and you can crash the host. There's all the normal drawbacks that come with running core kernel code instead of BPF, so we want it to be able to iterate quickly. We also want it to be able to implement bespoke schedulers, so schedulers that are targeted for specific applications, at least at first, and then maybe we can generalize it with the experimentation. But at Meta, we were able to build a scheduler that gives us about one and a half to 3% increase in throughput for our main web workload, and that figures, including a custom patch that we have in CFS as well, but with just SCETI-XC, we got one and a half, 3%, and about 3% to 6% P99 latency as well. And obviously at the scale of a large company, that's quite a lot of capacity savings, so it's worked out quite well for us. And then also, you can do other things. You can roll out new policies. So if we ever have another variant of L1TF, instead of waiting for a year for the mitigation to be implemented and designed and rolled out to the new kernel, and then you have to roll it out to your fleet, you can implement something to mitigate it in SCETI-XC and BPF, you roll it out, and then you don't have to really disrupt your whole fleet to deal with it. And then of course, moving some policy decisions and complexity into user space. One of the schedulers we wrote has a load balancer written in Rust in user space, with all the hot paths living in the kernel. And so, yeah, we had a lot of flexibility with SCETI-XC. But as we'll see, that last line makes a lot of people really nervous, so we'll have to figure out what to do about that. Okay, yeah, so this is the crux of what it is. You implement a struct ops struct. The callbacks are relatively self-explanatory, we tried to make it that way at least. For example, for select CPU, that's called when a task first wakes up, and whatever CPU you return is the CPU where the task is migrated at wake-up time. It doesn't have to be the last CPU that it runs on, but it's an optimization to try to decide where it should go. For example, we oftentimes in select CPU will just put the task into a global FIFO queue that we pull from when the core is going to go idle, and that lets us increase CPU utilization and get some good perf out of the scheduler. There's in queue, of course, DQ, there's every state change when a task becomes runnable, when it's running, when it's stopping, et cetera. You don't have to implement all of them, you can implement the only thing you need to implement is the name of the scheduler. We have default behavior for all the other ones. But that's how you do it, that's how you implement BPF scheduler. So here's an example of a really simple one. This flag here, switch partial, you can set that obviously when you open the program before you load it. And that lets you tell BPF if you want to only switch some tasks to use SCETI XT, or if you want to switch all of them, we recommend switching all of them because SCETI XT technically runs at a lower priority than CFS. If you have co-located CFS and SCETI XT tasks on a CPU, it doesn't really work out well because CFS will starve the CPU from SCETI XT. So usually for a default scheduler, you kind of want one or the other, but you could partition the host or something like that. So it's an option for you. And if you want to switch them all, there's a K-Funk here that you can call, which does it for you in the main kernel. More interestingly, on the NQ path, we can check some NQ flags to see if we should locally schedule the tasks, so keep it on the current CPU. There's another K-Funk here called SCX BPF dispatch, which I'll go into more details on in a later slide. You pass it the task, you pass it what we call a dispatch QID, and then you pass some NQ flags as well. And I'll go again into more details about that. And then there's this exit handler as well where you can print whatever you want. Why did you exit? Print some stats. Yeah, it's BPF. Okay, so what are dispatch Qs? So those are the main abstraction for bringing tasks out of the BPF scheduler and into the main kernel. There's an impedance obviously between what you want to do in BPF and what the actual system can do. So every CPU has what we call local dispatch Q, which is a FIFO queue, or you can also make it like a weighted V time queue if you want, but FIFO is a good mental model. And so that's the queue that you put the task in if you want it to be scheduled on that CPU. So that's the FIFO that the core scheduler pulls from in a pick next task, the callback in the SCED class. But you can implement any number of dispatch Qs that aren't the local ones. So for example, you could have a FIFO per CCX or per L3 cache, you could have one globally if it's like a single socket, single CCX per C group. So you could do FIFO scheduling amongst the C group. And at various stages of the task scheduling life cycle, you can schedule the task on a dispatch queue and then later consume it to actually pull it onto the local CPU, or you can do what's called direct dispatch if you want to put it directly on the CPU. Now it gets a little bit confusing because we have other data structures that you can use in BPF directly. For example, RB trees, which Dave Marchewski implemented for us, you can use that to implement your own weighted V time tree directly in the BPF scheduler. That's not the same thing as the dispatch Qs. The core kernel actually understands what dispatch Qs are, and we maintain them for you. We synchronize accessing them for you and everything. But if you have tasks in Qt and BPF, like in an RB tree, that's obviously only visible to BPF. And eventually you would have to put that onto a dispatch Q to be able to run it on the CPU. So there's two main operations for how you interact with a dispatch Q. There's the dispatch operation, SCXBPF dispatch, and there's consume SCXBPF consume. Yeah, so hopefully I'm not being too repetitive here, but the idea is when you want to put a task onto a FIFO that the core kernel understands, you dispatch it. You can also do direct dispatch where you put a task directly on the local dispatch Q for the CPU. Or if you don't have any tasks on the local dispatch Q, you can consume a task from a dispatch Q, and that pulls it off of the dispatch Q and puts it right onto your local dispatch Q for the CPU. So you can do dispatching, which is essentially in queuing in the FIFO, or consuming, which is pulling it and actually running it on your CPU. Yeah, so here's a flowchart example. You know, we don't have to spend 20 minutes understanding all of it, but we can go through it a little bit quickly. So if a task is waking up, we would call the select CPU callback that I mentioned will migrate you to that CPU. The task then becomes runnable. So we call the in queue callback. In in queue, you can dispatch the task or you could in queue it directly in the BPF scheduler. So if you did a direct dispatch, you said, I want to run this on my CPU right now, maybe because in select CPU, you detected that the CPU that you migrated it to is going to go idle. So you don't need to go through the whole in queue path, you just schedule it directly and let it run and you can just do that way. Or if you're over-committed, you actually have to share this CPU, you can in queue it in an RbTree, you can in queue it in a global dispatch queue that's not where it's going to run immediately. You can in queue it in BPF. And then on the other side, which is a representation of the dispatch path or the balance path, as we call it in the scheduler, which essentially means you're going to go idle if you don't find another core, excuse me, another task. This is what we do in SCETI XT. We check to see if you have any tasks to run and if we do, then great, we return. Otherwise we call the dispatch callback where you can either dispatch tasks to a dispatch queue or consume them. So a little bit, yeah, it takes some getting used to to kind of understand the terminology, but we tried to make it, you know, self- contained in terms of the lifecycle for these tasks and how to run them. Okay, so before we go on, obviously I'm sure there's like a million details that I either papered over or didn't even talk about. So any questions from anybody? Just a small question. Can you write a scheduler that will bypass the affinities set by user? You could. So actually no. So we check the CPU's pointer in a task to see if you try to schedule it on that CPU and the core kernel exe.c, not SCET core.c, but we'll check to see that you're not doing something you're not actually allowed to do and if you do something done, we'll kick the scheduler out and go back to CFS. So for example, if you returned a CPU that was like, you know, negative 20 and select CPU, we'll see that that's bogus and we'll just kick the scheduler out. So there's some sanity checking in the core scheduler, but what we did for the scheduler that I alluded to for Meta for the web workload is we had soft affinities. So you have a C group, we would give some cores to the C group, like let's say, you know, 16 cores in the system, we would give that C group priority to run on that subset of the system, but if, you know, they didn't need all of them, we could pull some other tasks from other C groups and just try to keep CPU utilization high. So that's the problem with like hard affinities, right? As you were going to underutilize the host, that's the downside, but something like SCET exe gives you a little more flexibility to kind of color gray the lines out between something like a hard affinity and just having the whole host be open. So to make sure I understand like you will fail on like a bogus CPU, but like you will not check that it respects like the whatever user has said with SCET CPU affinity or something. Yeah, so if you make a mistake in the scheduler, we will check that in the core kernel and we'll make sure that you can't do anything that would corrupt the host or crash the host or like do an incorrect operation. So that includes, for example, if you, if a task became runnable and it just sat there for like 30 seconds doing nothing, that's indicative of like a hung task that shouldn't happen in a correct system. So we'll kick the scheduler out there and go back to CFS. But you could, yeah, you can kind of, you can kind of, as long as there's, you know, the affinities you can't violate, but again, you could kind of create new new abstractions for like how do you provide resources in a looser manner? Yeah. When you transit a schedule to the other, from one implementation to the other one, so maybe you have some task in your internal queue, but you have a transseed to the other one. Is there any way you hand over the task in your queue to the other new task? So you, so you're, are you asking if you already locally dispatched it? Can you move it to another dispatch queue? Or? No, no, I mean that is in your, for any way you have an improvement scheduler, but that you have the other scheduler have a transseed from this one to the other one, but you have some tasks have a, in your local queue, is another dispatch to the CPU queue. So are you asking essentially how do we like prevent races where you have a task that we even queue, but now it's in CFS and you don't want to incorrectly dispatch it if it's not no longer running in SCETDXD? Or are you asking how do you, like if you have a task that's in queue in CPUA, can you, can you balance it and dispatch it to a remote CPU? No, I mean, for example, you have an arbitrary to keep the task locally, not dispatched to the CPU with queue yet. So at some point you will dispatch that task to the CPU, right? Yep. But at this point you're transseed to the other new scheduler. So you, you realize after you've dispatched it to the local CPU that you should have dispatched it to a different one to like increase utilization. Is that what you're saying? No, I mean the transseed to find one scheduler to the other scheduler. Right, so, okay, it wasn't, yeah, so you're worried about like running a task that should have been on CFS because you changed the SCETD class, how do we prevent those races? Okay, so the answer is you can, we'll check to see if it's still in CFS, or excuse me, in SCETDXD, by the time we dispatch it, but when you do, when you, when you change SCETD classes, you have to do a DQ to deactivate the task and then you re-enqueue it in the new scheduler. So it's, it's like a, it's a, it's a pretty slow, like well synchronized operation in the core scheduler to do that as well. So the, the scheduler will get the DQ callback when it's deactivated. It'll get pulled out of the, there's another callback I didn't show where you, you can disable a task in the scheduler as well, so you get a callback when it's leaving. And then, you know, it would be gone and we would, we would just verify that you didn't try to dispatch like a bogus task on the, on the EXE side. So another implication here is you can actually have spurious dispatches. So you could try to dispatch a task to multiple CPUs and we'll just detect that in EXE.C and, and do the one thing that we actually were able to do to try to simplify things for the schedulers. So the answer is, yeah, it's fine. Well, we'll detect it and, and it's, it's a, it's synchronized with that deactivate, activate workflow. Yeah. Cool. All right. So we can move on to the, the juicy stuff. So yeah, there's a common set of objections that I hear from people, like I said. Most people that I've spoken to, like almost everybody, except for maybe like literally two to four people, think that it's useful. They, they see the utility of being able to have a pluggable scheduler system, a scheduler system that you can easily experiment with so that you can then implement features in CFS and upstream them to like the, the main general scheduler. So the objections typically aren't about design or functionality, though if you have thoughts, obviously we'd love to hear them and we'll incorporate them. But they're, they're more about the sort of soft implications of something like Sketty XT. I don't agree with many of them, but they're, they're rooted in a larger question about BPF in the ecosystem. So I think they're, they're important to talk about now. And I think they're especially relevant now that, you know, we're moving to the sort of newer model of BPF or we're not using UAPI and things are really defined in the core kernel and they're implementing kind of core kernel functionality like Sketty XT. So the first one is that it's going to kill CFS contributions and then more generally that it's going to kill any upstream contributions to the scheduler or whatever subsystem we have the struct ops implementation in. So yeah, you know, they're going to, they're folks not to misquote anybody, but they're saying the schedulers are going to stay out of tree as soon as you have this ability to load a BPF program that doesn't taint the kernel, vendors are going to ship their own schedulers, people aren't going to contribute anymore and you're going to have non-GPL schedulers. That part is just wrong because we'll actually check that in the verifier, but these are the kinds of things that people say. I'm sure there's some truth to what they're saying, but we as a community, I think need to come up with like a general philosophy, I would say, towards these questions. You know, for example, and I would talk about this in another slide too, but we, for modules, there's a very clear incentive to upstream them, right? Like if you break something in an upstream module, that's a bug. If you break something in an out-of-tree module, nobody cares, nobody's going to give you support, anything like that. With BPF, it's sort of a middle ground, right? Like you can load a BPF program that's not in tree and the only time that you would fix something is if like you crash the kernel or the verifier was wrong, but that's the point, is like you would fix it. It didn't taint the kernel. So for SCED-X, we can tell people things like, okay, if you upstream a scheduler and struct ops change, we won't let your scheduler break. We'll change your scheduler to use the new callbacks and we'll prevent build breakages or even performance progressions. But we don't really have like, that hasn't been formalized in the community yet, right? Like I don't think I wouldn't say that we have a very clear like philosophy or policy for upstreamed BPF programs. So I know that we have upstream some, but do folks have like any thoughts about what it really means to upstream BPF and like, how will that be consumed and like, you know, what's generally like, can we come up with, can we come up with a policy that's like modules? It's like clear, crystal clear. I'm probably commenting on something else, but just the question that popped up in my hand. So this looks like kind of all on or nothing, right? Either using this SCED-EXT class or you're using CFS. I guess there might be maybe some middle ground where like for a task you say, okay, I'm overriding some of the CFS policies, but then the rest is, for example, your case with AMD, like I know I maybe I need to prefer course one, two, three, four for the rest. Let the CFS play with it. You can't do that because CFS is like a weighted fair v-time scheduler, right? So if you implement certain callbacks to SCED-EXT, like you're deciding where to migrate it, it doesn't really fit in with like the math of the scheduler at a higher level and it's just, you can't, like this complexity is so insane in the core scheduler. Like things that happen in the callback that are like supposedly supposed to be for migration are actually like core math for the whole algorithm. So it's an interesting idea, but I think we have to think about partitioning the system for schedulers per core, not per per task. Because I know, for example, Android, right? Like they are notoriously known for like vendors doing some custom CFS performance value add hacks, which I don't know. Well, so that's exactly right. And so we, like when I was in Italy at OSPM a few weeks ago, there were a few people that I basically said, okay, well what do you think is going to happen if like who would stop contributing to CFS if we did this? And the people that I spoke to couldn't think of anybody because it's a pretty small community. But more importantly, the situation couldn't really get any worse is how I see it, right? Like you have Android which has lots of vendor hooks, you know, because they had to for performance reasons. You have Steam Deck and like, and you know, Valve who have all these, the Linux gaming community has so many out of tree patches. And that's fine, you know, we can tell people that. But we have to, I think in order to answer this question, like confidently, we have to come up with a policy that like, that formalizes incentivizing people to upstream BPF programs, right? Because the difference like, I mean for XDP programs and stuff like that, there might be some sometimes or like you want to upstream it, you want to open source it, obviously there's psyllium and lots of things like that. But for upstreaming into the kernel tree and treating these programs more like modules than like, then sort of standalone BPF programs that do a specific, like implement a trigger or something like that, we have to figure out what does it mean to have a BPF module, you know? It's, is it different than XDP? Is it, is it the same? Is it is it the same as a module or is it midway between? You know, it's, I guess, I guess from my point of view, from whatever BPF programs would do at Google, right? Like there's really nothing to upstream with some custom business logic no one cares about. I would assume like the, the scheduler you have as well, right? Oh, if it's this my magic process, treat it really carefully, put it here and there, like in the rest is. So, so there are schedulers you can upstream, so, so yeah, I probably should take that line off because it's, it's literally like bespoke. But for example, one of the example schedulers we implemented is a tickless scheduler. So, if you have a VM, you have a bunch of VMs running or a bunch of eCPUs running on a host, you could have a scheduler where all decisions are made from a single CPU. And you send a reschedged IPI when you want to trigger a reschedged on that core. And so that avoids VM exits because of time interrupts, you know, like you can, you can do things that in CFS don't make any sense at all. And, you know, so in, in, in other words, you could, you could imagine like a VM, like a heavily optimized VM scheduler that you would upstream, that it really is generalizable. Or like even the bespoke scheduler would probably be appropriately generalized as like a scheduler that maps CPUs loosely to C groups, right? You know, in the one that I wrote, I'm literally checking like the task name and doing something if it's an HHVM thread, but, but that's because we're experimenting, but eventually we wouldn't, we wouldn't do that, you know. I mean, to play the devil's advocate, I guess, for the scheduler folks, if you're upstreaming BPF scheduler, why don't you just like write a, it in C and call it another scheduler plus. Well, that you can't do, yes. So that's, that's another story. So when you, when you add a feature to the scheduler, you cannot add like, like ARM is running in CFS, right? They're not, there wasn't like a power aware scheduler and, and the server scheduler. So our argument is, okay, fine, that's, that's great, but you need somewhere to experiment with it. Look, you're right. I mean, there, there's that, that could happen, but it's probably no answer, but I'm just, yeah, it's hypothetical. Yeah, yeah. If we want you done with the experiments, what do you do? So about where to put the scheduler, have you look at the preload the BPF program? Yeah, yeah. So you think that's a good candidate or is no for some reason? Potentially. Yeah, I mean, like something where you have like a skeleton, essentially, that you check in, but we have to, but it's, it's a little different, right? Cause this is about upstream ability. This is about like the guarantees that you get without any UAPI or anything like that. So in terms of where to put it, yeah, maybe that's a good idea. We can put it there. But we need to figure that part out, like the sort of mechanical part. And I think we need to figure out like the policy, just like for K-funks, we decided there's no strict ABI stability. That's I think the part that there's that, at least in my opinion is, is like, like if you were, if you were a scheduler person, you don't care if we put the scheduler, we can put it in like kernel SCED, we can put it in preload, we can put it in BPF, like that part, I think they're not as concerned about. I think it's like, like they don't believe that anybody will ever put anything anywhere, you know? For the like logistic part, I think like once you get in there, like there's a way to test whether it's in build, right? You change your K-funk, there's a way to know whether this one is broken or not. So, right, okay. So you think that we should basically go to build? It sounds like you're arguing for the guarantee being similar to a module, where like you cannot regress the scheduler, if it's upstream. You cannot regress the build, or like if you put something there, of course this has to be something meaningful, it's not like a random shit, you just put there, just make sure no one remove, maybe that's a way to make sure no one remove your K-funk. Well, I don't know whether we want that, but like that's probably one way to do it. You, well, I mean, there's going to be like a bar for upstreaming your schedulers too, like I consider it similar to a file system. So yeah, I mean, there might be somebody who's like, I really don't want this K-funk to go away, but if they have a scheduler where they need the K-funk, that's actually quite like, okay, they upstreamed it, you know, there's a legitimate use case that they did, so. Yeah, and I want to add like the value is like maybe for whoever sharing their schedulers, maybe they don't see the value, but like I would say like for many people who reading the patch, there's definitely value. For example, I have very little background in the scheduler. I want to look at, hey, why people look at this scheduler? Why they're making this decision? Why Google uses this and the meta use something different is just good learning experience for everyone. So I see like from the community point of view, there's definitely like a value like sharing what people are using. Yeah, I mean, that's open source. So I think another thing, Daniel, did you want to say something? Okay. So the, so that's one argument is the build breakage and that's what I've been telling people. So I'm glad that you agree something. But there are people that say fine, but it doesn't matter because you're going to have, well, so let's move on to the next one. Okay. Let's talk about this one first. Okay. So people are like fine. You know, we can upstream some schedulers. Let's say the CFS thing won't be a problem, but now distros are going to have to support vendor schedulers and you know, everybody who writes an out of tree VPF scheduler, now we have to support that. And you know, it's going to be a nightmare. Like what if this vendor's performance regresses or whatever. So this is, Yuri brought this point up. Yuri Levy, who's a, works at Red Hat and he's a scheduler maintainer. And so obviously this isn't the first time that a distro has to, you know, has to worry about VPF or it would be, rather would be this would, this isn't a new situation for distros. Like rel markets itself is an excellent distro for XDP. So there's actually benefits to, to VPF for, for distros, but do we, you know, like given that this is a very core piece of the kernel, this is a scheduler. Like do we need to come up with kind of our own philosophy around like the kind of larger, the, the, the responsibility of the larger ecosystem or like our expectations for, for what the kernel community would have to support, right? Like, like if you load, if you load a VPF scheduler where you don't schedule any, any like IO threads and you have IO timeout, like, is that not a, like people, basically people are saying that we need to, we need to taint out of tree schedulers and to protect distros, to protect developers. Daniel and I spoke about this earlier and you made a, a very valid point which is that it's probably going to discourage a lot of people from using the scheduler. Yeah, I mean like, like, I think well customers or others, like if this would taint the kernel then they are basically, I mean back then when I work at Red Hat it is the first thing we look and then you say, well, I don't support you because your kernel is tainted. Yeah. I think maybe the way like how the question could be asked from like a distro support side, like do you run into this problem with the upstream CFS scheduler as well or is this just really specific to the scheduler program that you run? So, but I, I mean I would be against tainting because that would but yeah, I am too, I am too, but it's, it's hard to, so like a lot of this is just me finding myself in difficult conversations that I'm trying to kind of like brainstorm with everybody. It's hard to argue with somebody that says something like, oh, this is extremely core relative to something like XDP and for something this core it's like, it really deserves a taint and if you need your scheduler to not taint the kernel you would upstream it, you know, but I do agree with you, Daniel. I think it would, it would, you know, the whole point of this is you can use BPF. So that would suck if we had to taint it. So to add a little bit to it, so we can actually distinguish different taints. So one thing is, I think what Daniel is worried about is tainting like everything, everything, right? And it doesn't make sense like take it to extreme. It doesn't make sense to taint for like classic BPF. TCP DOM, somebody says TCP DOM and there is a taint, right? So the BPF is loaded but does it make sense? Not really, right? But at extreme case, for example, thinking about out of three kernel modules that are exposing K-funcs and this K-func potentially doing some like garbage, so BPF called it and that module like did something and then like could cause the BPF to like misbehave. Right, also potentially possible. So it should somehow flag the situation. So it's not like scheduled or anything related but since K-funcs can be in modules I potentially see some sort of not necessarily taint flag but I would call trait flag or something that people has a way to debug that like kernel module. Like this example, the one Kubernetes were loading special kernel modules with WebAssembly running and calling BPF from it. What this WebAssembly was doing, no idea, right? But then like imagine CISBOT report saying like yeah, BPF is crushing because of that. But that makes sense but in the example you gave if you loaded an out-of-tree module it doesn't matter that if there were K-funcs or not, right? Like that would actually taint the whole kernel. So I think are you saying that because like when I think of just the core SCET-EXT it's a like you could mess up the system but eventually we'll kick you out we'll recover and like we should go back to CFS and it's a bug if we don't recover correctly but I don't know, I mean it's yeah, I agree with you and I think it would be a lot easier to get it upstreamed if we had some kind of middle ground between like a hard taint and a FYI there's a new scheduler that isn't upstreamed that's an even a temporary state or something like that but it's yeah that's the other consideration. And maybe the other thing that would be useful in this context like when you somehow crash or like when you get a splat right at your print that a BPF scheduler was loaded and you could get the users get the information about it instead of a taint maybe that's a good idea to do regardless we're just gonna add like the first one of the first things we do when we get bugs now is like we just run BPF tool and say like give us a list of all the programs give us a list of all the maps so that we know why everything is attached and running so I mean I'm not sure why the distros can't just adopt the same stance it's like if you're gonna file a bug then they have their bug tool stuff run on crash they have their crash dump thing run can't they just collect all the BPF tool programs which will be very explicitly say scheduled thing is attached here's the program type here's the ID here's the cookie or whatever you want however you want to identify the different ones and just go like they'll know right like it's not shouldn't be it's difficult to know that this stuff is attached so I don't think what I've noticed crash or crash already collects it like you have a fingerprint of all the BPF programs and maps the problem is that they knowing that there was a scheduler isn't there isn't there worry right it's that they would have to they would have to debug a system that crashed with a scheduler that was like some out of tree thing that they they didn't know about even the system shouldn't have crashed but people are worried that they will have to support like like I think people are not convinced that we can do a very core scheduler safely in BPF even though they're not right and we can I mean I guess my my point is they they'll get the crash dump it'll tell them there was a scheduler that was attached they'll look at the crash dump and if then they'll need to make a decision whether or not they say like hey we're not going to support you because you ran this or or they will right like I'm not sure if they need anything more than what they have right like why do they yeah I mean I agree yeah like so we so do you think I agree with with you I agree with Daniel I agree with Alexa I think maybe in the short term we just are very very explicit about what scheduler was loaded we tell you exactly when it was loaded whatever and you know if it becomes a prop we can tell distros like look it should be very clear like if you have a task that was never run and it was a sketchy scheduler like maybe there's an issue maybe not but hopefully giving our policy could be we'll give you all the information about it but we officially don't think that this is a kernel bug if we did crash the kernel it's a BPF bug and we'll fix it but otherwise you know you should be able you should be able to load a BPF program a BPF scheduler and it doesn't take anything if it's if it's out of tree I mean that would be I mean I'm not on the distros side never have been but that would be my my take it'd be like well you're gonna get a crash dump it's either gonna be somewhere completely outside of scheduling that has nothing to do with scheduling you'll be like I don't really care if there was a thing attached or not or you're gonna look at it it'll say like hey look there's a backtrace through the scheduler or it's like something didn't get scheduled run it again without your scheduler or take a hike you're not gonna fix your scheduler for you that's true the only thing you could cause a crash somewhere else even if it wasn't in the scheduler code right if you're like if you deadlock or something like that but yeah that's I agree yeah and also we have I think we've heard this argument also before in the case of networking with xtp right and it just hasn't been a problem in practice that it's a support nightmare and so on and I don't know like I would love to hear now like after so many years like what's the take I mean it seems to be okay I don't know yeah well and yeah and distros advertise their ecosystem for adding now it's a feature for them people pay money so I think yeah the takeaway is we just stand behind it and we we keep our we kind of stand our ground on on the policy cool and the other thing is so maybe it's more like related once you upgrade your distro um then maybe you run into some performance regressions but given you have loaded your own custom BPF scheduler it's just in you to figure out what's going on and in that case right so that right and so that goes back to the first bullet point so I think that if you're an out of tree scheduler you get no guarantees other than that you shouldn't crash the kernel right you're just a BPF program yeah and no build breakage regressions it's your fault performance regressions it's on you and then if it's if it's upstream you get both performance and build build protection yeah and that's also like in the case of networking the same stands yeah yeah okay awesome so this is the last one this which is always a really fun discussion so so everybody this is like I have had an extremely difficult time discussing this with people and here's why so the people's rationale is they're worried that BPF is going to force the UAPI into the scheduler now the the first thing that I say to people when they say this is don't worry I know that a lot of BPF is in UAPI headers like if you're doing you know if you're you could you could you could have UAPI issues if you use the wrong type of BPF program but for struct ops it's not UAPI at all you can chain struct ops it's only exposed in internal kernel headers and we don't have to worry about it at all and Linus had talked about this a little bit at the the kernel maintainer summit Steven yesterday pointed out that it was actually directed at him but the general the general you know perspective of folks if I had to paraphrase here I had to represent what they say is that they're worried that today BPF is not a UAPI issue because Linus hasn't seen a problem but eventually it could be and they're basically not willing to accept anything else and so like I don't really know like like is there anything that we can do as a community to like make an outward strong stance that could potentially put their minds at ease like could we say that we'll never that we'll never you know violate UAPI and like I mean we can't tell Linus to do anything but if we had this is like an official policy do folks think that would help this is really I guess directed more to people that have have more upstream experience and maybe more of a relationship with Linus but if this has ever been an issue for anybody else that's trying to add a BPF feature like you know yeah I think you guys know what I'm getting at right like like how do we how do we really drive home and be clear about the expectations about this very open-handed wavy question oh whatever you know what I mean and to give you to give another example so I explain this to somebody and they came back and they said well because because you so BPF is a kernel to kernel program the kernels calling out to a kernel program they're in CPL 0 for x86 pretty crystal clear but then in the kernel side in BPF you have you know a map that shared with user space like you could imagine a protocol where you in BPF program you publish some messages to user space which is essentially a protocol to inform a user space scheduler where everything is in user space which is like ghost of how to implement a scheduler and so you know the the the kernel part yeah you know it's no no UAPI there but in practical terms if you ever did regress something and some big out of tree you know user space scheduler even if it's upstream I don't know but this big user space scheduler regresses well now it's a problem okay don't you you have to you have to revert like have we ever kind of like really thought through the implications of how to prevent that sort of scenario and like you know if that makes yeah hopefully that is somewhat clear as far as an ask the answer might be that we just stand by what we think is going to be the case like we just say look there's no UAPI guarantees at all it's it's it's unstable we documented just like we did for K funx and we just you know I guess we can't force people to believe something so maybe that maybe the move is for us it is documented and and hope for the best document it with good intentions and hope that's how it how it turns out I think that's probably the minimum we should do yeah so if there's anything above that yeah that's what I would love to hear about because it feels like a very difficult you know thing to to sort of like logically approach something to think about we can take it offline okay so so I'll be done by five I just want to go through a few a few other slides just some changes that we had to add to to BPF to support all this stuff so one of them is a new K pointer called a CPU mask BPF CPU mask it's a wrapper around the internal kernel CPU mask T you can create them treat them like normal K pointers store them in maps you can you can interact with them like normal normal CPU masks one of the nice features that we added is you can actually compare a BPF CPU mask directly to a CPU mask T we have we figure out that they're type type identical according to the C standard and so you can compare you know like a read only internal kernel CPU mask with a BPF one which is kind of nice and we use that and some of our schedulers that we that we added in the the patches we sent upstream this one is pretty interesting so the we talked about this actually during Alexis presentation about one of the big changes that we had but originally with K pointers and just as a refresher to people that aren't aware K pointers are a type of object in BPF programs where you can safely store an internal kernel object in a BPF map and know that it'll get reaped when the when the map goes away you can you can essentially store internal kernel objects safely and so originally to get to get a new ref count on an object that was in a map you had to call this K pointer you had to use this K pointer get API which assumed that which verified that you're passing at a map value and you had to have you know internal synchronization where you like would go under an RCU lock and do all these kinds of things and so now we realize that generally what almost every object in the kernel actually cares about is it RCU safe or not and so we instead updated the verifier to to leverage RCU safety to know if you can actually trust a K pointer so that you can pass it to to a Kfunks and stuff like that safely so on the left here you use K pointer get it does an atomic acquire when successful and then you can call this BPFC group ancestor K funk but on the right side you can do you go into an RCU read region you read the map value if it's present then you can use it like any other secret pointer because you know it'll be valid by the time the read region ends so yeah pretty nifty one implication here is that for the K pointer get you could synchronize whatever you wanted to in the K funk so if you had a spin lock if you had whatever it's just an implementation detail of the K funk now that we have this notion of you know like the BPF program itself containing the synchronization if we ever wanted to add a new method for for synchronizing K pointers we would have to implement that in BPF but you know practically speaking I don't think that's very likely at least it hasn't been a problem yet but something to keep in mind for the future yeah so another citation wrote a scheduler that that recursively walks C groups and flattens the hierarchy so that you have you don't have to do a recursive walk to implement the CPU controller for deciding which which C group you should pull a task from and how long it should run the the CPU controller we don't use it in meta because it's it's too slow for us but using this approach which which is part of the upstream patches and using Dave's arbitrary map type we were able to implement you know a version of of CPU control that's slightly less precise than recursive walking every time you do load balancing but but is good enough for us and gives us pretty good performance so that's another kind of nifty thing local keypoint of stashing just reading the thing yeah okay and then Dave had also talked about the BPF obj new APIs so yeah you can store those in maps do you also it's it's very flexible framework now for for a for writing code that really looks like kernel code yeah and this is another one using the RB tree APIs and that's it we did a lot of other stuff but that's probably enough for now I had a question certain in terms of the earlier point that you made for upstreaming schedulers right so that would be like for generic ones that would apply in many cases that are useful to many people right yeah yeah I don't think it should be a dumping ground I think it should be held to a pretty high bar I mean the idea there's a couple ways that those that we could go in that direction and I think no matter what it should be high quality but we could either have these schedulers be like general for like a large class of workloads like this is a pervert scheduler and this is a you know this is a I don't know like a gaming scheduler or it's like frame rates are what's what's optimized for or even more generally like a soft real-time scheduler might be more appropriate for for games so I think we can go that direction where we implement we make this the Linux you know scheduler ecosystem kind of more like file systems where you choose which one you want and you don't generally like you're there's no expectation that we're going to always kind of go into one file system but or we could do the second thing which is we add these upstream schedulers people use them they see if they like them they see if they work and eventually if they're like proven to really work well and we get the API is right and the feature is right we upstream that into CFS and then we remove the scheduler when you can just see CFS so I actually think that personally hardware is getting so complex and like the needs for performance are so aggressively like like we're trying to each every single instruction we can out of the the CPU that the former of the two is more likely to be like a clean ecosystem that's extensible and is you know is really kind of self-describing but the realities of the scheduler you know like the scheduler community which I understand very much where they're coming from is it might be I think they would be more open to the idea if it's supposed to be kind of a funnel you know into into CFS so how complete this overall I guess architecture can I implement CFS for example on it you can yes I mean so okay so CFS is like enormous obviously there's like a million heuristics you can implement a weighted V-time fair scheduler pretty easily actually and you can add a lot of heuristics and stuff as well it's stable enough that we're rolling it out to prod at meta and we think it's feature complete for data centers but we thought that when we went to LPC last year and then Josh Don and other people from Google pointed out a lot of like huge feature gaps that we implemented that were super useful like the tickless scheduling as an example so it's it's it's it's definitely featureful and stable enough for you to implement it's like a normal scheduler right like a V-time scheduler like like a EEVDF which is the new SCED class that Peter proposed recently but if there's anything you know that we we haven't added that would be useful we could do that too I was sort of wondering about your example earlier about like you have a scheduler where part of it's being done in user space and you've got like kind of like the like the shape of the map so whatever the coordination mechanisms are right just to be clear sorry do you mean the one where we did the load balancing in user space or the whole thing is in user space and you're publishing like so I'm thinking about this from the kind of UI UI API a concern kind of angle yeah sure and like would it make sense to say like the line is either you're shoving the entire user space program plus the you know the program in the kernel or you like none of that goes into the kernel because as soon as you split the user space side it kind of says well the distribution is separate now you kind of have this some sort of a contract that says at least this version of the you know BPR program requires this sort of interaction with users yeah that's a really interesting point I mean I think everything else aside UI API aside it would be a good idea to upstream the user space part of that too right my worry is that you know this is still kernel space and user space right so somebody could somebody could implement a different user space scheduling framework on top of the the kernel BPF program because the kernel layer is basically a messaging layer at that point right like it's it's it's propagating synchronous scheduling events into a an asynchronous like ring buffer or something like that and then there's a messaging layer where you you map that to callbacks and user space and you could do that in multiple ways I think the the thing that like the real crux of the issue with that is is there like a line in the sand for BPF programs where like practically speaking it is like it's basically just a little conduit between the kernel and and and user space right like it's not a sys call and I don't like I don't see really if you're looking at a purely any difference between a module that has you know some some interaction with user space as well but you know like right to your point right like if we upstreamed that user space program absolutely but you could easily see somebody else coming out of tree and saying like well look you know like this is a small thing and I have this big scheduling ecosystem that people use and and that's kind of the the worry I guess one of my thoughts is like the clearer that description of what it is and if it's you know there is not like don't upstream something with with user space interaction or it's like it must be and will only work with the one that is upstream then those are at least like simpler rules to sort of reason about and say this is what you can expect rather than sort of like oh well you know it's here and you know it's code and do whatever you want yeah so I that's a really good point my my two cents would be that the more like conservative we can be in terms of like what we consider valid so you have to upstream the user space portion you have to upstream BPF there's a coupling there that is that is entirely yeah they're they're entirely coupled I I imagine that would be that would be in our best interest and it's going to give us the the fewest surprises and again this is why this is like a departure from the traditional approach to BPF right like these are these are essentially supposed to be replacements for modules and so I think we have to we have to do whatever we can to try to minimize the risk of UAPI becoming an issue and I I personally would say you know we all like to upstream so for the people in this room it probably doesn't really matter anyways right you we should be aggressive about saying like anything we can to minimize the risk of UAPI and minimize the risk of people accusing us of putting them at risk yeah so yeah does does anybody like horribly disagree with requiring a user space program using a BPF schedule to be upstream to be considered like protected I just have a comment to back to the question about or CFS I think a CFS is actually a very rich set of features and each of them are not very trivial to implement I don't know I don't know how complete the scheduler schedule schedulers implemented in meta but I can I can think of many features in CFS that it's really really non-trivial to implement can you think of this I'm sorry go ahead speaking of a load of balance load of balancing include this periodic load of balancing include idle balancing include this know her balance in all of this and we do have all of those implemented yeah sorry I didn't mean interrupt you continue with that so if you have all this implemented the next question is that do you plan to provide this as some building block to others to build schedulers uh so that's I think it's a really good question I think yes we absolutely should there's a larger question to answer their first in BPF which is like how do you implement like BPF libraries essentially and right now you can implement some some stuff in a header but it maybe that's what we do for the short term but I think sure you know if people like I I imagine there's going to be a class of schedulers that like is like they want to do so shared wake you is the feature that I was I was alluding to where you have this that's the feature that we have internally that we're trying to that we're going to upstream soon where you have this global FIFO when a when a task is waking up and then when you get when you go idle it basically instead of new idle balance you pull a task off of that that queue so maybe maybe that's like something we could give to schedulers and they could build with their own machinery around that as an example but you know the the goal is to be able to do anything CFS does in BPF and right now I don't I don't feel comfortable saying we can do anything because we're scaling and there's probably stuff that's not that's like server related that we can't do as well and for a lot of stuff we can do it's pretty ugly like so so for the load balancing Taysian wrote that the first the first version of that was like nested BPF calls and like like oh man like the the the the heroics that we had to do to like not get the verifier to yell at us for like some random thing I mean it was it was really really ugly so it's there there's an argument to be made in favor of CFS from that perspective as well but I think it's a good goal to be able to say you know like if we're if we're trying to go towards usability and we're also trying to go towards you know feature richness I think that there is we have the capacity to provide an ecosystem to the BPF schedulers that that should make it much easier not only to implement like a simple policy but to implement a complex policy as well or like you forget to drop a ref count on like 30 branches you know some air path somewhere like BPF will catch that for you or or that kind of things so to end the rambling we can do a lot like we everything you mentioned we could we could we we can do and we have in our example schedulers that we that we sent upstream and if there's anything we missed any features that you would need or that you think are are unlikely that we would be able to do because I'm sure there are yeah let's you know let's take a look at them and and let's put it on the roadmap and enable it then what's then what's the next step do you plan to replicate the features in CFS or actually focus on implementing new features that the CFS can't do right now so okay so this is our plan for now on on our side at least so we're rolling this out internally so we're going to be using this in prod and and that's you know because of the savings that we're getting it's it's it's a little bit of a priority for us at the moment in in parallel to that I'm adding so the shared weight queue patches that I keep pointing at or keep alluding to that was entirely done and see and and and schedule XT first that's we experimented with a lot of different approaches to load balancing between shared weight queues and stuff like that I'm trying upstream that right now and then kind of more broadly it's me and Taysian that are working on this with help from from folks like Dave and Alexi and the the BPF community but we're trying to get other people in the industry to to test to like experiment with it you know we're getting a lot of pushback from people who who really you know they're afraid of it or like they don't maybe they're not like the biggest fans of BPF and so the way that we think that we can get this in if people do think it's useful is to just rally like the the kernel community around it right so again you know Valve is interested ARM AMD people have said that they're going to start experimenting and getting getting interest like that is kind of sort of our main focus at the moment you know I answer that question is because I think is then important because there is determines what's the position of the schedule is do you think of it as a maybe a takeover of the CFS or just a compliment of something the CFS cannot do I think I kind of prefer the ladder because I think is there are many useful features that didn't get upstreamed or merged into CFS but they're really useful such as soft affinity that you mentioned I know this is a there's a patch that the proposed upstream several years ago but eventually didn't get in but it's super useful for this AMD CCX architecture yeah so I think if I were you to doing this driving the effort in schedule you see I would do something that was develop features that the CFS cannot do right now because that's something useful and helping people to to kind of compliment on the CFS side absolutely I completely agree with you and we're doing that right so so that's shared wake you is is our first upstream attempt at that and we I mean look it's extremely difficult to to try to upstream things into CFS it's very complex the bar for getting things in is quite high and so what we want to do is provide an environment where yeah you can try things out and even maybe merge something and see how it works and see how industry uses it and change it or throw it away if it doesn't work that well but the intention is to to go places that CFS could go but it's just really hard right like so that's shared wake you the soft affinity is one that we're that we're going to roll out yeah CPU control absolutely we actually and so CPU control it's funny we tried to implement that it took us somebody on our team tried to upstream that for 18 months into CFS and somebody came in and acted because there's a corner case where if you have a whole bunch of low priority C groups that are waking up at the same time you have a thundering her problem and they get too much CPU okay so we can't go into CFS but it was 18 months of effort right and Tatian wrote the flat CG scheduler which is useful for us I mean it's Tatian so he's like he's like a very very good engineer but it took him like a week right so that's a delta of like I don't know 80 weeks something like that so yeah we're trying and it's just a a question of scaling at this point and so the other soft affinity which we're working on and then something that that came up a lot at OSPM is soft real time where you have tasks that have low latency requirements and the user experience is much better if they can run quickly VR you know mobile for for rendering threads whatever but if they don't run it's fine nobody's gonna die the plane's not gonna crash and so for those kinds of requirements where like it's hard real time you can't use SCET-EXE you can't use CFS you have to use deadline or like RT or something like that but but we want to be able to you know implement like yeah like soft real time as well which I think is very very pervasive like chrome could definitely use it there was a proposal at OSPM I wish that Joel were here he has way more context than me so I don't want to speak for him but there's a proposal to like potentially have to make chrome use SCET-RT which would mean it has to be root which obviously is not quite tenable for upstreaming either so yeah I mean there's there's a lot of things that CFS can't do yet and it's you know I it's just we believe in upstreaming like we think everybody should benefit and we want to upstream it to CFS and for us this feels like the the easiest way to do it you know all right I think we're closing out here thank you very much really exciting work thank you yep