 All right. Hi everybody. My name is Joel Fernandez and today I'm going to talk about fixing real-time throttling in the Linux kernel. So this is a problem we've had for a long time and I've seen it so many times that I got sick of it this year and I decided, okay, this is what I'm going to do this year because I don't want to deal with it next year or the year after, hopefully. So this is about our use case at Google and why this matters and why fixing this is important. So real-time scheduling, that's what I'm going to talk about. So all the information presented here is publicly available and there's nothing confidential. Chrome OS and the Linux kernel are both open source projects and so everything is all the code for both these projects are upstream and available. So a little bit about me first. So I work on Chrome OS currently. Previously I worked on the Android project. I also work on the core Linux kernel subsystems like RCU scheduling. I'm interested in timers as well, IRQs, whatever the case might be, anything in the core of the kernel. That's kind of my focus area. Locking, et cetera. I've been at Google for about seven years now and more about me is on my website. Everything is available there so feel free to check it out if you want to know more about me. All right, so we'll start with a little bit of background on the Linux scheduling classes just to set up the context a little better. So we have this completely fair scheduler which is also called the SCADOther scheduling class. It treats all processes equally more or less and allocates CPU time based on the process weight. From the man pages, SCADOther is a standard Linux time-sharing scheduler that is intended for all threads that do not require the special real-time mechanisms. So it's kind of like it tries to treat threads that are assigned SCADOther fairly, but it's also a little unfair in the sense that you can change the weights of the different threads and change the amount of CPU time, the time slice that they're given. You also have SCADBatch which we'll not cover today. It's similar to SCADOther, but there are some slight differences, but it doesn't matter for this presentation. And then we have in the real-time world, we have SCADRR and SCAD50, and processes scheduled under one of the real-time policies, SCAD50 or SCADOther. They have the SCAD priority value which goes from 1 to 99. This is also known as a static priority. And the important thing with this is that the task with the highest static priority will pre-end the lower priority ones for as long as needed. The difference between SCADRR and SCAD50 is SCADRR are within a specific static priority. SCADRR does a round robin between different threads where SCAD50 is first in first out. So it's going to not, even if there are other threads in the same priority level, it's not going to let those run. And then we have SCAD deadline which is higher priority than real-time. And tasks don't have priorities. Instead, they have these timing properties that are assigned to them like runtime, deadline, period. And this policy is implemented using the earliest deadline first algorithm in conduction with the CBS constant bandwidth server algorithm. So the idea here is you want to run the task that has the earliest deadline first, but also you want to guarantee a certain amount of runtime within a period for different tasks. Like you're guaranteeing that these properties are satisfied. That is done by the constant bandwidth server algorithm. So we won't go into too much detail for those, but I just wanted to mention this. So in summary, it looks like this. So you have CFS at the bottom, the CFS task. And then you have RT on top of that, the RT scheduling class. So when anything in RT is running, CFS will not get to run at all, period. And then on top of this, you have deadline which has the same concept. When deadline is running, nothing in RT or CFS will ever get to run. So that's a little bit of background. So now let's jump into our world in the Chrome OS operating system. So Chrome OS, the operating system is largely based on Chrome. Everything is like a Chrome process, including the UI and all the UI elements, the menus and so forth. So this is a very process heavy, multi-process architecture. So this also applies to how the Chrome browser works. You have these different processes. You have browser process which is responsible for the window, the omnibox, the menus and all that kind of stuff. And then you have the individual tabs showing the rendered HTML content and stuff. So each of those tabs are like separate processes. So that's just the design of it, because you have this process isolation between different tabs. You can have them interfere like when one crashes, the other one's worn and so forth. And then you have this third process called the VIST process or the GPU process, which the renderer processes and the browser process, they send frames to it and it puts it all together and this shows it on the screen. So the GPU process or the VIST process that I just showed you, what it does is it accepts these compositor frames from the renderer processes and the browser process and it does this thing called display compositing, which essentially is combining frames from multiple sources together and putting them together and aggregating them and showing them on the screen. Okay, so the main point here is there's a lot going on. There's a lot of processes and the schedule has a very important role here in making sure that the system is performing properly and there's no problems. So just to show you what happens in an input event, let's go through the thread flow when you have a mouse down event. So first you have the hard IRQ and the IRQ threads that deal with the input device and they interact with the EV Dev thread in the browser process. This is happening through the input event subsystem in the Linux kernel. EV Dev is like a framework for input events and then this wakes up the main thread in the browser process saying, hey, you know, there's an input event here to process. That goes and does IPC to the render process because this whole thing is like clicking inside the renderer. So the browser knows that, okay, this input event is directed to this renderer. So it sends IPC to the render process and the IO thread is what handles that IPC. That then passes the event to the compositor thread and then that passes it to the main thread. So the reason this is two steps is because some input events can be directly handled by the compositor like scrolling or zooming. These things don't require the renderer main thread to do anything because the layer is already rastered and essentially the compositor thread can directly move it for scrolling and so forth. So it saves the main thread work and the main thread can do other things. But in this case, it decided that it had to wake up the main thread and so it did that and then there's a response back to the compositor thread after the main thread handled the event and then that goes back to the browser process. So you can see already there's so many threads, so much waking up going on, so many jumps just to handle a single input event and then that goes back to the main thread. So in this flow, I haven't shown you the GPU process because there was no GPU process involved when I traced this down because there was nothing drawn on the screen. I was just clicking inside an empty window and just tracing everything and so forth. But you do have, if something was redrawn on the screen, you'll now have GPU process also. So you have even more threads and so essentially when you click and say there's a redraw or something, if there's a scheduling delay in any of these wakeups and if any of these threads don't get a chance to run, now that shows up as an input event delay. So you're only as strong as your weakest link, so this has to work correctly. And so when I traced this down, I looked at all the priorities of what Chrome was setting these threads to and so the RQ thread was set to RT priority 49. And then all the other threads were CFS. Yes, go ahead. Okay, okay. Yeah, I was using F trace. Have you heard of it? No, that's the kernel view, but people who don't know what the kernel view is. Okay, okay. Yeah, okay, okay. Yeah, all right. Thanks. Okay, yeah, priority 50 user space. Right. So yeah, so you have these wakeups that happen and yeah, they're essentially all the Chrome knows that these threads are high priority. So they were already set to like nice minus eight, which is like a very heavy weight CFS weight given to these threads, right. And that does help like the threads get more CPU time and so forth. It helps in some way. However, this is sort of broken, because the RQ thread is RT, and the rest of the threads are CFS, then that kind of doesn't make sense because you know, what was the point of the RQ thread being RT when the others were lower than RT? And on top of that, I found that the main thread was set to nice zero. So you know, again, as I was saying, they're only as strong as our weakest link, right. So that's, so that's a problem. So can we do better? Like, can we just set everything to RT already? Right. So we tried that and, you know, we did this test, we have this 49 person Google meet test where I fire up Google meets and you have these fake, you know, fake people in this in this test. And we tried this on the low end Chromebook and just setting everything to RT. We set the, you know, what was nice minus eight, which changes to RT plus eight, because we could just flip like, you know, the sign and just set it easily in the user space code. This is just for testing. So we set all threads that we, that were important, you know, in that pipeline I showed you to RT. And we measured the, so we measured in this test, we measured the mouse latency and it went down by 32% just by doing this. So clearly like, you know, this, there's a benefit of doing that. So the problem is with using RT for everything is this guy, this main thread, that main thread actually can run JavaScript. That's where like, you know, that's doing the main job of doing the parsing of the HTML and building data structures and all this kind of stuff. So it, you know, and running JavaScript, so it can, it can really be busy. So basically, as you can see, like it's, and the main thread is involved in the, in the, you know, in the, in the input pipeline. So it's, it's really important. So if it gets busy, it can, you know, we can drop like input events, right? That main thread is really important. And it can run JavaScript. So, so can we, so the problem is like RT throttling, like, so what happens is if, if, if the main thread takes too long to run, then we have a protection mechanism in the kernel called RT throttling, which can kick in and shut that thread down. So can we make RT more friendly for CPU bond processes? So let me talk about RT throttling first of all. So the problem is RT tasks can star CFS tasks, as I was showing you, we have this hierarchy CFS RT deadline. And several of these CFS tasks, they can be like important threads, like kernel threads, you know, there's, there's many CFS threads on the system that we don't want to start with the whole system will hang. And so RT throttling is this protection mechanism by default, it let's RT tasks can consume 95% of, of, of CPU, but that last 5% is kept around so that CFS can, can still run. And here graphically, I'm showing it, you know, you have real time running 95% of time and, and a little bit is left for, for CFS. So what is the problem with this whole mechanism? If we want to reduce the RT runtime, so if we reduce that 95% say we make it like 70% or something, then the CPU is idle while we're waiting for RT tasks. We're essentially throwing away CPU cycles. If we draw, if we reduce the runtime too much, and if we increase the runtime too much, we're starving CFS tasks, right, like maybe those CFS tasks need to run. So this mechanism doesn't really work very well. And there are many more issues with this mechanism, which we won't go over in this talk, but you're welcome to reach out later. I had a previous talk where I, I, you know, I went in great detail on the, on the problem. So I, I, you know, it's, it was an OSPM conference. So you're welcome to review those slides. But this mechanism is horrible and it's really broken. The result I've seen over the years is this is typically how performance engineer, you know, ends up, ends up working, working through these problems. So first he encounters a performance problem and then he starts setting things to RT. And then everything works well and great. And you know, the product shift and, and they go home and everything. And then there's a bug report due to throttling in the field. I've seen this over and over and over. And, and then people revert everything back to CFS and then repeat. So I've seen this over and over again. And that's kind of what made me sick and tired of it. And, and that's why I started looking at it this year to do something about it. Yeah. So the old RT throttling design is, so the way RT throttling works is you have this RT run queue. That's a part of this general run queue. And the RT run queue is removed from the, from the main run queue. You can have a hierarchy of RT run queues because you can have C groups or you can have run queues within run queues. So the way that the thing works is as soon as scheduler detects that, you know, okay, you know, time as RT is running for too long, it takes that RT run queue off of the hierarchy. And that happens in this piece of code in RT.C, where you can see that it'll do an scan RT DQ. So we cannot pick them there. So this is the pick loop in RT.C and where it tries to pick like an RT task that's runnable. And it cannot pick them. We cannot pick them. That's how throttling works, right? We're trying to pick, but it's gone because the throttling took it out as I showed you in the previous slide. So for the throttling duration, there's a timer that is set up. Timer goes off and says, okay, we've punished RT enough. Now let's put it back into the run queue. So for the, until then the pick loop in RT cannot pick because there's nothing to pick. So my first attempt was I want to run RT tasks. But when the CPU is idle, the problem is that even when the CPU is idle, because the RT run queue was remote from the hierarchy, it, there's nothing to pick, right? So the CPU is left idle. So that's one thing I wanted to fix was I want to run these tasks when CPU is idle. In Chromebooks, we first of all have fewer CPUs than like say servers, right? So it's ridiculous if we leave the CPU idle. So my, my idea was, okay, I want to keep them on the run queue, but not run them if, if, if possible, but I still want the option to run them. And I still want to maintain the fact that throttling has happened, right? I want to, so I, all that accounting for throttling, I, I maintain that. So I left the RT time and the timer that I mentioned, all of that is left intact. But the modification I made was when the, when the runtime is exceeded, we do not remove it from the hierarchy. And also another modification was when the timer goes off, we don't have to put it back on, we don't have to enqueue it back either because we never took it off in the first place, right? And then we modified the pick loop to pick the throttled tasks if, if the CPU was, CPUs are idle. That was the first thing we want to do. The next thing we want to do was not only when the CPU is idle, but even when CFS is running, we can time share RT and CFS. And that's not ideal, but at least it would, it would kind of make RT run like CFS once the throttling kicked in, right? So it's kind of like we're demoting RT to CFS if it was abusing the CPU. You see? So that, that's, we actually got that working as well. And there were like, I'm just mentioning the, the simple stuff. There were like several corner cases. We broke stuff all over the place and we fixed stuff. One of the interesting ones was obviously because when the timer goes off and we don't put stuff back under the run queue, we don't do an enqueue back. We have no opportunity to enter the, enter the scheduling loop because enqueue was already doing that. So we had to manually, when we unthrottle something, we had a manually trigger reschedule or it was, it wasn't working. And then the, and we actually ended up simplifying code. One of the current cases was because we don't take, on throttling, we don't take the stuff off of the queue. We don't have to worry about cases where when an enqueue happens, what was the task throttled or not, stuff like that. In fact, we shouldn't worry. Otherwise, this whole mechanism will fall apart and we ended up deleting code and that went away as well. Many other cases will not cover here. Unfortunately, this could not be upstream. We went to Italy and talked about it and it's a long story. Yeah. So let me go over why not, right? So the stream community does not want RT throttling fixes. That's what I took away. They want us to instead handle starvation of CFS tasks by boosting them to, to scan deadline. Good news is that it might just work in a slower complexity. And then once we get that working, we can just delete RT throttling completely. And we already have, people have written patches over the years, but it needs more work and we're looking at it. So how much time do we have? We have a lot of time. So now I'll go with that mechanism and how that can address RT throttling. So these patches were developed by Peter Zostra and improved by, by you read their members of our scheduler community. Just to go over the hierarchy of, you know, scheduling, this is, you know, scan other, scan RR or FIFO and then scan deadline. That's the hierarchy. And the main idea with these patches was to guarantee a deadline reservation for the CFS tasks. Basically what we would like to do is guarantee that CFS gets a runtime R every period P. In other words, it's guaranteed to get a certain percentage of the CPU. So even if RT, you know, took up a lot of bandwidth, we still have, we may not starving the CFS tasks. So it consists of this fake scared deadline task, which is added to the struct RQ. It's called a fair server because essentially what it is, it's a deadline server. It's not a real deadline task. It's just a container for CFS and that container is, is deadline. Let's see. And this fake deadline task is given one clock tick every 20 clock ticks, so like 5%. But this should really be configurable. So we should change that so that we're able to tune that. And yeah, so I'll go over some of the concepts in these patches. So one thing is to make this work, we, these patches have to modify the CFS wakeup path. So the first CFS task that wakes up has to start the deadline server. So essentially it's starting this fake deadline task. It basically is, makes it up. Yes. The question is, will this not introduce delays? Let's see. It's only done on the first task that wakes up. And in the NQ path, we do hold the run queue lock already. So it's already, you know, we're, we're already holding that. So in my opinion, it shouldn't, it shouldn't matter. But yeah, we have to test it. We're still working on that. Okay. The other question is, how many, if you have like five CFS tasks, you invoke this deadline server condition, how many do you run? Because every time you are scheduling CFS tasks, that means if a RT task is ready to run, when would that become a priority? Do you just say you're running this CFS task one, idle and then you're doing, running that and you complete that and go back to your regular queue? No. Is it one or two or three or four? I guess that's a good question. I guess the question might be is if it's a deadline task, that means it's really a deadline first, where the old throttling is, it doesn't kick in until we hit the max. So the question is this now has to be the 5% at the end, but by the deadline server, it's going to be the 5% at the beginning. Yeah. So the sixth patch modifies. So that is exactly a valid problem. And that is addressed as well. So with the thing with deadline is it's all about guaranteeing bandwidth, right? So we don't really need to run it until we really need to run it. So that's handled as well, but you're absolutely right because otherwise you can have RT getting interrupted quite a lot and stuff. So I believe I cover that as well in the slides. Okay. So the first thing that wakes up, we wake up the deadline server, the fake deadline task. And then in the sleep path as well, we have to stop the deadline server at some point. So we do that when the last CFS task is not runnable anymore. And we also need some extra functions that deadline can ask CFS, hey, do you have anything to run? And also another function for asking CFS, hey, I'm running now, can you pick something for me? So there are these functions that have to be provided to just get deadline by CFS. And yeah. And so this is the deadline pick loop. So deadline, when it's picking deadline tasks, it checks if, okay, this deadline entity that I'm picking now is that the fake deadline task. If it is, it then calls that CFS function I just showed you to do the pick. And then to Shua's question, Yuri made further changes to not start the deal server until later because all we have to do is guarantee that CFS gets to run, not run it immediately. And the code for that looks something like this. So he basically starts this thing called the watchdog and that delays the starting of the server. And it was exactly for that problem. Oh, it's like, my question, which might be yours too. So the question now is, how does this start? So when RT is not running, we want it running, right? RT is not running. So, I mean, if, or it just says, okay, so this runs whenever CFS is running, regardless of if something's in RT is running or not. Correct? Or is it? Yeah, yeah, yeah. So the question is, so if RT is not running, then CFS will, it'll fall back to CFS. So this so this is only if RT tasks are queued on the same CPU. You raise a good point. So this, I mean, that probably does, I bet you it does because it doesn't make sense to have this. If RT is not running, you just don't want you because you don't want this overhead. You don't want things. So basically there is probably a statement is it's because it's probably when it's in queue, but because, you know, it's going to be in queue, RT is running, but you can queue it. So it doesn't get to run, but you say, hey, it can't run because RT is running. Let's start the deadline server for it. That's, I think it has to be that way. Wait, so can you walk me through the exact? So basically this, the point I'm saying is if RT is not running, we want CF to run without this involved. So, so this, so the timer will start and there's no deadline until this goes off, right? Until the watchdog goes off. No, but the thing I'm saying is this shouldn't even be executed if there's no RT running. But you don't know if RT will suddenly start running. And if there's, I'm wondering if we should let's say if RT starts running and then it needs to trigger the CFS, we could do that. Yeah. I think this came up in one of the discussions as well. Like can we, can we actually start the watchdog itself until we really need it? Right. So yeah, I think that's a, that's a good point because you might have a situation where there's no RT running at all for a long time and then why are we doing this, right? Yeah, exactly. Was that, was that kind of like what you're asking? It's something else. So I just to clarify, I started looking at this, these new patches a week ago. So I'm still catching up as well. That's perfectly fine. Yes. But yeah, I'm happy to discuss. So one question I have is, so when does the fake, once a fake DL process task gets on the queue, does it go off? Does it get taken off? And when will that get taken off? Does it stay on there forever? So, okay. So I think I discovered that it's, it's here. It's taken off when the last CFS task goes to sleep because we don't need the deal so many more. Okay. So, so if you have like four CFS tasks that got put on this deal, like you, so when the last one goes off, it goes to sleep. So what happens the first one? I mean, if it needs to be scheduled, it'll never get scheduled. It'll get a, say you have four tasks on there. So until the last one goes off, you're running them. Yeah, the four that you have once the four are done. Yeah. The NR running is going to be four, right? So it'll never stop. It'll go from four to three, three to one, finally when zero will stop it. Okay. So you, but you still have the one in there and then when you hit the, no, you can't, right? You, it goes, well, yes. Right. So it's no longer, as soon as it finishes, it's time slice that fake process drops, a fake task dot drops off the list. Yeah. So there's two things. Yeah. The time slice is a different concept. And then you have runnable versus sleeping. So the time slice is like the time is shared when all of them are runnable. Maybe I just want to make one confusing when running just means that something wants to run. It doesn't mean it's actually running. So it's, it's getting tight. It's not that has its time slice or not. So sometimes that's, sometimes that's confusing for people. Sometimes the fact that you're running, we have several terms for running. There's one that's actually here on the CPU or something like that, or you're actually just, I just want to run. This is, I just want to run. So this means like everyone else has just said, okay, I'm done running or I've exited or I'm going to sleep or, yeah. Yeah. And I think my other paths are like migration. Like if, if something runnable is moved to another CPU, then this will trigger as well. So in that case, also the server will stop. And I'm 100% sure this mechanism, like from my experience working on scheduling features, this is going to have problems somewhere that we don't know about yet. Even after we're, we're 100% sure that there are no problems. So, all right. Yeah. So, yeah. So actually that's all I really had. This is the code that does the, instead of starting the server, we started until we really needed. Yeah. And so, yeah. So stay tuned. We'll use this, improve these patches and report back later. Hopefully next year we're talking about different issue and not this, not the same issue or we're talking about the same issue, but we've solved it. We'll see how that goes. At least like the scheduler community is on board with this idea, this way of solving it. So that's why I'm positive as well that we can get it, get this done quickly because we have really good people on board and who are working on this stuff with us. So, thanks. And yeah, we can open it up for any other questions as well. So the workload you showed, the workload with the multiple threads, that's a combined workload. So one of the things that is is poor planning on the workload part is looking for dependencies. Yes, you do need to solve the problem of RT throttling and DL. But the other problem I see in this picture is that the workload analysis, when you do look at the workload and flow and see who is impeding whom, then you would have, if a workload analysis is done, you would have up the priority of the main thread, don't you? I mean, you would have recognized that in the path. Yeah, so the thing is we could set the main thread priority to what they were for the others, but that won't really help. That won't really solve the whole problem. Like, yeah, that's one part of it. Yeah, that won't solve the whole problem. But I'm saying that's also a problem when you look at the workload. So it's sort of out of scope of what Joel's talked about right now. But one other little thing that Joel and I have actually talked about before is having some way of doing a wake-up priority inheritance. So you basically, like we've said this, for things like call chains, that we could keep main thread at a low priority until the compositor thread at a higher priority says, wake up, and then it will boost it. But we have to figure out a mechanism on how to do that. But that's how you do the call chain like this, where we don't want main thread always running a high priority unless I need it to. It's on demand, basically, priority boosting. But that's a separate, that's going to be another talk. Yeah. Anyone else? Cool. All right. Thank you.