 Today, I'm going to be talking about an overview of scheduling in the FreeBSD kernel. So we're going to look at the scheduler in the FreeBSD kernel. We're going to dive down into how it works, in particular some of the more recent changes that have been made in order to work with the ever larger number of core systems on which we're running. Scheduling is how we decide when and where and for how long to run all the threads that are in the system. There's a bunch of threads that are in the operating system. There's threads that are in the applications that are running. A particular process may have a single thread or it may have a whole set of threads that are running within it. So when it comes to scheduling, we are looking at all of the threads that I've just described. These threads get really divided into five different classes. We have the i thread, which is all the things running in the bottom half of the kernel. And for the most part, those are all the interrupts that are coming in, all the asynchronous activities. So disk interrupts, network interrupts, timer interrupts. Anything that is delivering interrupts and requires a thread to run in order to process those interrupts. We have the current threads, which are all of the threads that are running in the top half of the kernel. And the top half of the kernel is all the synchronous things that the kernel is doing. For the most part, it is being driven by system calls that are coming in from the applications. So you do a read or a write or an open or a close. That system call comes into the kernel. And while it's running in the top half of the kernel, it will get one of these current level priorities. And those tend to be higher than the priorities of processes that are running in user space. Because we want to get them through the kernel and out of the kernel so they're not holding locks that can potentially compete with other threads that are trying to get access to the data structures that they're looking at. Between these two, we have the real-time priorities. And the real-time priorities are for user processes that are real-time processes. And on the next slide, we're going to dig in a little deeper on how these priorities get set. User processes, for the most part, run in this timeshare range. And this is the area, the set of priorities that we mostly focus on when we're talking about schedulers. Because it's the set of priorities that the kernel is devising for each of the processes that's running in the system. And then finally, at the very lowest set of priorities here, we have what are called the idle priorities. And these are priorities that are for very much background tasks. They'd be like a screensaver or some other sort of activity that will not run if there's anything else that wants to go on within the system. So you can see the numeric priorities here run from 0 to 255. Higher values of priority imply lower levels of service. That is to say, 0 is the highest priority and 255 is the lowest priority. The i-thread and Kern classes are managed by the kernel. The real-time priorities and the idle priorities are managed by user processes. And so the user processes just set them. The kernel will then just work with whatever they're set to. And then finally, we have the time share class. And this is the management of the priorities that are shared between the kernel and user processes. So user processes can influence these. For example, you can raise or lower your nice value, which will bias them up or down. But the kernel is actually selecting what that base level priority ought to be. Let's look at each of these different types of priorities in turn. For the kernel priorities, the kernel is pretty much in charge of those. The system administrator can actually fiddle with some of those priorities if they want to. So typically, the interrupt threads are, the highest priority is given to things that need the lowest latency. And also that have high frequency. So things like network packets tend to get a high priority. Whereas things like terminals, if you have a serial line coming in, that tends to get a low priority. But for example, in the old days we used to use serial lines to actually do networking. We had a thing called serial line IP. And in that case, it was characters were coming in much faster than a user would typically type them. And so it was important that that artificially be given a higher priority so that you wouldn't lose the characters. Okay, but for the most part, the kernel priorities, we don't make any changes to. So that really gives us three sets of areas that we can work with. The first of these is real time. And here, the processes are setting the specific priority and the kernel doesn't second guess them. It doesn't like raise or lower them or change them. It just runs with that. And the effect of this is that if you set a real time priority and that process then goes into an infinite loop and never gives up the CPU, then the system will effectively look like it's locked up. Because the only thing that the kernel will be able to do is the interrupt threads. But none of the top half of the kernel will ever run. None of the user level will ever run. So if you have a shell and you're trying to fix something, you're out of luck. So you have to be very careful if you're doing real time priorities that you don't let yourself get into an infinite loop. That you do give up the processor periodically so that other things can happen. Then next to here are the schedulers that are available. The primary schedule that we use today is the interactive scheduler, the so-called ULE. And then we still have the traditional scheduler that's so called 4BSD scheduler. The 4BSD scheduler was actually written in 1978 by Bill Joy and myself as a stop-gap measure until we had time to do something right. And it actually lasted well into the OOs. ULE first came in FreeBSD version five. And it was originally supposed to be the scheduler that would be the default scheduler, but due to various problems that it had, it wasn't really until FreeBSD 8 that it could be made the default scheduler for the system. And 4BSD is still there. It's a very simple scheduler. It doesn't have a lot of the, as you'll see, the abilities that ULE has. But it's still useful for certain embedded applications, where you just have a small processor, a small number of processes running. And you don't need all of the bells and whistles of ULE. You just need something simple and fast. So this is still there. I'm going to not really talk much about it. In this lecture, I'm going to focus on ULE because that is a default scheduler and it's the one that has the capabilities that we need on modern multi-core systems. So ULE, as we will see, deals with things like processor affinity. And it actively calculates an interactivity score so that it can figure out which processes are batched, sort of long-running computing processes and which ones users are interacting with so that it can give better priority to those interactive processes, better response time. Finally, the idle scheduler, much like the real-time scheduler, the priorities are simply set by the system administrator. The kernel doesn't mouse around with them. It simply follows them. And but unlike real-time, if one of these goes into an infinite loop, it's not a problem because if there's anything else in the system that wants to run, they will preempt it. And so the idle priorities really are not going to get you into any trouble. Now, before we start diving down into the details of this, I want to talk about how scheduling is really divided into two different parts. There's what I call the low-level scheduler and the high-level scheduler. And the low-level scheduler is the scheduling that runs when we do a context switch. And on a modern processor that's busy, context switching is something that occurs tens of thousands or even hundreds of thousands of times a second. And so we need to have a very short time to figure out what we're going to run next. If the current process is blocked and says I don't need to run now, do something else, we do not want to spend a lot of time trying to figure out what the next process ought to be. And so the low-level scheduler, as we'll see, is just a set of priority queues. And the low-level scheduler just finds the highest priority thing to run and runs it. And then, as we'll see, that's a very simple process. It's like find a queue that's not empty, pick the first thing off the list, run it. The higher-level scheduler is the part that is figuring out where those different things should be, what their priorities ought to be. Now, of course, in the case of real time, those decisions are being made actually out in user application land. The user application is figuring out what the priorities of each of its threads ought to be. And that's presumably something that it reevaluates periodically and can change those priorities. For the ULE or for BSD, these decisions are being made inside the kernel itself. And as we will see, this is done much less frequently. So, for example, for BSD, once a second would run through all the processes in the system and recalculate what their priorities ought to be based on what they'd been doing recently. ULE goes a step beyond that. ULE wants to never have to look at every process to make decisions about priorities. So as you'll see, it tracks within each process the information that it needs to be able to make decisions about whether its priority should be raised or lowered, whether it should be considered interactive or batch. And thus figures out that priority in a way that doesn't require it looking at everything all at once. And again, the idle scheduler, much like the real-time scheduler, the high-level decisions are being made in the user-level application that's running where it can periodically decide if it wants to change the priorities that it's setting things to. The kernel is only dealing with it at the low level, and that is where the processes are in the queue or the threads are in the queue and where they ought to be run. This gets us to the low-level part of the scheduling. The four-BSD scheduler just has a single global set of run queues organized from highest to lowest priority. It was designed in a day where we didn't have multiprocessors, so just having a single queue for all of the processes made a certain amount of sense, even when you have a small amount of multiprocessing. You don't have too much contention for the lock to be able to go look at that queue. As you'll see in ULE, that's no longer true. So here, it just had the set of processes, so it would just scan the list by highest priority running. ULE scheduler actually uses three sets of run queues and there's a set of run queues for each CPU. So each core in the system has its own set of run queues. So when a processor stops running one thing and wants to select another one, it doesn't look across all the queues in the system. It just looks at its own queues. So in the case of ULE, each CPU has three queues from which to work. And first of all, we have the real-time queue, and it has on it all of the kernel, both the top and bottom half kernel, and all of the real-time processes. And those of the timeshare threads that are classified as interactive. All right, and this is organized as a set of priorities from 0 to 171. And so this queue is the first one that it will look at, as we'll see later. And as long as there's anything in this queue, that's what it will pick to run. When this one is empty, it will then move on to what's called the batch queue. And this has all of the timeshared threads that are classified as batch. And it's actually organized in a thing called a calendar queue. And I will have a slide a little later on that explains how calendar queues work. And then finally, there's an idle queue, and it has processes in this top idle set of priorities. So at a low level, we'll check this. If there's none there, we'll do this. If there's none there, then we will do that. And as I said, unlike the old one where we had a single global queue that we had to have a lock on in order to manipulate, here we just have the queues for each processor. So that processor doesn't have to lock the queue because it's the only CPU that's ever going to be looking at it. The one exception to that is that we may need to move processes from one CPU to another. This is called load balancing. And we'll talk about how load balancing works again in a later slide. Here's our priority based queues. And this priority based queue is the same as the old 4BSD one. Or as the first and third of the ULE ones. And the priority queues are just organized as an array of queue heads, link lists. And so what we need to do is starting at the lowest numbered one here. We scan through until we find one that has something on it. So initially we find that the priority 95 here has an emacs. So we're going to take that emacs. What we do is we take it off this list, we run it, and then if it goes to sleep, it's just gone until it touches time as it wakes up. If it uses up its time slice, then it will come back here and get put at the end of the Q95, which in this case would be the only thing there, so it would just essentially run until it went to sleep. There would then be nothing here, dot, dot, dot, dot, up until we get to 120. And now we're going to have rogue. And if it uses up its time slice, it'll go to the end. And so we would then run VI and rogue and VI and rogue and VI and rogue and VI until such time as both of them were sleeping. And then we'd get down here and start running XV and Firefox and so on and so forth. And if all of these were then sleeping, so there was nothing on this Q, then we would move on and go to the batch queue, in the case of ULE. Now, we do not want to have to scan through 10 or 20 or 30 or 40 Qs in order to find which one has something on it. And if you think about cache lines, there's a whole lot of cache lines we would have to hit to search through, since each one of these is at least eight bytes. So in order to speed finding where there's a non-empty Q, we actually have a bit array here, which is called the status array. And the bit is set if there's something in that Q and it's zero if there is nothing in that Q. So here we can just pick up this single word or two words, scan through to find the first bit that's set. And that gives us then the index where we're gonna find a Q that has something in it. So we can pick up this and then go directly to the Q that we need, pull the first thing off the Q, and run it. For ULE then, when we get to the batch level thing, we are going to switch from the interactive pure priority based to something called a calendar based Q. And the idea of a calendar Q is to be fairer than we are with the priority Qs. With the priority Qs, a high priority thing, it can essentially starve anything at a lower priority from ever getting to run. Now, if we've done our work right, we aren't gonna have anything on those higher priority Qs that's gonna run for very long. If it's interactive, it's supposed to do a little bit and then go to sleep. So if it starts misbehaving in the sense of using a lot of cycles, then what we will do is change it to being batch. And then it will end up down here with the other batch processes. Now, if it then goes back to behaving itself, then perhaps it will get to move back to the interactive Q. All right, so the idea of the calendar Q is we want to try and be fair and let everything run at least a bit. Now, things that have higher priority should get to run more than things that are at a lower priority. But we don't ever want to get to the point where something that's just running continuously at a higher priority just blocks all the lower things from going on. So a calendar Q works as a circular Q. And so the size of the Q here is NQ, the number of entries in the Q. And we have this pointer runQ here, which points to the current entry on which we are operating. And what's going to happen is that we are going to, once we get to runQ here, we are going to run everything that's on the list. So in this case, there's only one thing. It's the C compiler. It is going to run and it might go to sleep, but it'll probably just run until we decide that it's not supposed to run anymore. It's used up its time slice. And at that point, we are going to put it back onto the calendar Q. And where we put it back on the calendar Q is a function of its priority. So we have this int Q, and this is the point where we're inserting. And it's going to keep moving along as well, in a way I'll describe in a moment. But what we're going to do is when we've used up the time slice for a particular piece, we are going to insert it at the location of int Q. But we are going to then add in its priority. So the base priority here is the 171 or whatever the bottom priority is. So if it's running at sort of the maximum batch priority, then this priority minus that would be 0, and it would simply go in right here. But let's suppose its priority is 10 above that highest priority. And so then what would happen is it would be int Q plus 10, mod of course the size of the Q. So it's going to come much further down in this list. And that means a bunch of other things are going to get to run before it gets to run again, and then it's going to get put pretty far through the list. And a bunch of other things get to run and so on. So things that have a higher batch priority will get to run sort of each time slot as it comes along. And things with lower priority will only get to run every third time or every eighth time or every twelfth time. But everything's going to get some amount of opportunity. At least once, each time around the Q here, it's going to get to run. Okay, now in some cases we're going to get down here where we've got multiple things on the list. And so this one will run and then it'll get stuck ahead somewhere and then this one will run and it'll get stuck ahead somewhere as we described up here above. And this list might be at a higher priority, so it might just go one or two ahead of where it came off. Whereas T rough might be at a lower priority, so it gets stuck way up here somewhere. Okay, so in SQ, what we do is that we increment it every 10 milliseconds. So it's just sort of plotting along every 10 milliseconds. So if we get stuck with a whole bunch of things to do, then in SQ may actually streak out ahead of us. And the net effect of that is to sort of push the lower priority things even further into the future. Generally speaking though, if run Q gets, it's done everything here. This Q becomes empty, then it's going to increment up to the next Q. And after you've incremented run Q, if int Q is equal to run Q, then we also increment int Q. So int Q will always be at least one ahead of run Q. And it may be more. In most cases, it actually just walks along just one slot ahead of run Q. But if the system gets particularly busy, then it may get further up ahead of run Q. And the idea there again is to just push things further out into the future. Just summarizing what I said for ULE about the run priority. If that set of priority Qs, which includes the real time threads, any of the real time threads, then we're going to select the first thread in the highest priority, non-empty Q. We saw that couple of slides ago. If this is completely empty, then any of the batch Q threads run the calendar Q starting from the first thread at the current entry. And the time slice that we use there is typically on the order of five or ten milliseconds. But if there's a lot of entries in the Q, we will divide that slice by the number of entries in the Q. But we never let it get below two or three milliseconds, because otherwise we would end up just getting too much churn. And then finally, if there's no calendar Qs, then if there's anything in the idle Q, we will run that. And just for accounting purposes, we always have an idle thread, which is at the absolute highest priority. Which is just there so that we don't have to have a special case in the code for what if there's not anything in any of these. So you may notice when you do a PS you'll see each CPU will have some idle that will have accumulated some huge amount of time. And that's just soaking up any time that there was nothing for that particular core to be doing. So now let's drill down and look at the driver of this low level scheduling. How does this actually work? Well, first of all, how do we choose to do a context switch? And for the priority base Qs, we're going to run all the threads in the top Q. We give each one of them 10 milliseconds, but we want to make sure that every thread's going to run within 50 milliseconds. So a Q that's got 10 threads will use a five millisecond time slice. We do have a lower level on that. So that if we get down to having, if it would be less than five milliseconds, then we just give it five milliseconds. So that it can spread out a little bit. But generally speaking, we don't hit that limit. The idea is we don't want to just churn too quickly because there's a cost to doing the context switch of reloading caches and so on. For the calendar base Qs, we run all the threads in the current slot until it's empty. We give every one of them a time slice of the same duration as we used up here. So the same business of 50 milliseconds or 10 milliseconds, but 50 milliseconds max for any given Q, et cetera. And when a thread other than the currently running thread attains a better priority, then we will switch immediately. So if, for example, we're running something in the calendar Q, but something comes in, interrupt comes in on the top priority Q, then we will immediately switch over to that interrupt thread. Actually it's not quite immediate as we'll see on the next slide. But as soon as it's practical, we will switch over and run that. Scheduling a context switch, what actually causes these tens or hundreds of thousands of context switch per seconds to happen? The most common case is voluntarily going to sleep. So some process does some system call that can't proceed. So for example, you read from the keyboard and the user hasn't typed anything. So you go to sleep until the next keystroke gets typed. Or you're waiting for a packet to come down the network. And so you go to sleep waiting for the next packet to arrive. Or you have issued a disk I owe and you're waiting for the data to arrive from the disk. So this is the most common case and in this case, this is called a voluntary context switch. That is, you are saying I don't need the CPU anymore, run something else. Okay, another fairly common case is when a higher priority thread becomes runnable, and this is most commonly from interrupt level. So a network packet arrives or a timer goes off or a disk I owe completes. And this then causes the currently lower priority process to stop and the higher priority process to get to run. And then we have once per time slice. So this 10 millisecond timer that I talked about going off will say, okay, you've used up your time slice and now it's time for something else to get to run. Now, we said that, for example, when a higher priority process comes in, we want to switch to that process. Now, often on a multi-core system, there will be an idle CPU somewhere. And so it can just be handed to the idle CPU and it can run and it doesn't need to interrupt the particular process that may have actually noticed that the interrupt happened. Because the interrupts generally are targeted at a particular CPU or a particular set of CPUs. But assuming that we want to stop running the lower priority thread on this machine so that the interrupt can be handled. For example, sometimes we will pin interrupts and say these interrupts have to run on this CPU. And so if there's a lower priority process, we want to clear it out of the way so that interrupt can run. We don't necessarily want to stop that other thread exactly where it is. It might be in the middle of a critical section. It certainly could be holding some longer term locks. And we don't want to put it to sleep while it's holding those locks, because that's going to just cause excess resource contention. So instead, what we do is we set a flag in its threads flag saying, I need to be rescheduled. And then that request will be processed after, if it's an interrupt thread, when the interrupt has returned or at the end of the current system call. So if we've done a system call as we're getting ready to return back out of the kernel, back to the user process, we notice that this reschedule flag has been set. And at that point, we know any locks that were held have been released, at least any short term locks. And so it is now safe to contact switch off to this higher priority process. Now, if the rescheduling is involuntary, that is the it's, we're not switching because the previous process went to sleep. But we're switching because it's one of these cases where it's a higher priority process that wants to run. In that case, the process that we're switching away from still does want to run. So we need to put it back onto the run queue. Because remember, what happens is we take it off the run queue, we run it. And then if we go to sleep, we're done. And we just go find the next one we want to run. But if it wants to continue running, it's used up its time slice or it's being preempted, then we need to put it back onto the appropriate run queue. So that some point in the future, it can continue running. How do we actually contact switch? Well, what we're doing is under kernel control, we're going to switch from one thread to another. And there's essentially three steps. We need to save the context of the current thread, which is the register's address space, etc. That's defined by the hardware, what you need to save. We are then going to choose the next thread that we want to run. And then we are going to, having chosen it, load its context. So whatever was the register's and address space needs to be reloaded here. Remember that I said this low level context which runs a lot. It's anywhere from typically one to as much as 5% of all the cycles on the machine are spent doing this. And so you really, really, really want this to be fast. Now, of the three parts, the context load and save are defined by the hardware. There's nothing you can do to make that any faster. So the only place you have any room to maneuver is this picking the next thread. And that's why we want the low level scheduler to be so simple. Find the bit of the queue you want, pull that thing off the queue, load it. Typically in FreeBSD this can be done in about 20 assembly language instructions. So that's why we split the kernel into high and low level schedulers. So that we can make that low level scheduler very, very quick. That allows us then to take much more elaborate steps in the higher level part of the scheduler because it's running much less frequently. So there it's okay for us to spend some time really sitting down and thinking about what we want to do. The next step that I want to look at then is ULE. What is it trying to do? And this now, we're talking about the high level part of the ULE scheduler. Our goals are we want to identify and give low latency to interactive threads. This will allow, that we want to allow for brief bursts of activity. In order to make this work, we have to differentiate the time where we are waiting for the CPU versus the time we are waiting for user input. If the system is really busy, you may not be running simply because you haven't been able to be scheduled yet. But we don't want to give you credit for that time waiting because you're in with everything else that's trying to run. So what you get credit for is the amount of time that you're actually not wanting to run. That is you've gone to sleep and you haven't taken some action that says now I want to run again. Another thing that we would like to do is in general to have a given thread to keep running on the same CPU over and over and over again. Rather than to have it bouncing around from one to another. This is called processor affinity. And the reason is because you have state that's on a processor. The processor has caches, memory caches, and VM caches and other things. And so to the extent that we can run you fairly soon on the same CPU, a lot of those memory cache, the L1, L2 cache, are going to still be there. And if we move you to a different CPU then you're going to have to get all that back into the L1, L2 cache of the other CPU that we've moved you to. So generally speaking we want to try and keep you on the same CPU. The fact that each CPU has its own set of scheduling queues sort of naturally gives us that. Because the only way you're going to switch from one to another is at a higher level we decide all right, that CPU is just too busy. We're going to move you from its queue to some other CPU. When we do decide to move you then we need to think about where we want to move you to and this is sometimes referred to as NUMA. So you have different, different CPUs have different characteristics relative to the one that you're on. So as I've said, the same CPU is sort of the ideal case. But if there's multiple cores that are on the same chip, they'll often share at least some of their caches. And so if we can move you from one CPU on the chip to another CPU on a chip, that is not going to be as bad as, for example, moving you to another chip that's on the same board. So if it's a dual socket board and we move you from a CPU on one of the sockets to the other socket, then there's no shared caching between the CPUs themselves. There's possibly some caching on the memory plane, that's the memory that's on that particular board versus have to, for example, go to the memory on a different board. But it's still further away than something that's within the same chip. And then of course, we may have multiple boards in the chassis and we may have multiple chassis. So again, staying on a CPU in our chassis is gonna likely be better than going somewhere else. The other goal of ULE is not to have anything where it starts with for every CPU or for every thread. We never have to look at every thread in order to make decisions. We occasionally do look at all or most of the CPUs. But again, that's done on the order of once a second. We amortize the fact that we do that over the fact that it doesn't happen very often. So let's start with how we differentiate interactive versus batch. So we've got a bunch of scheduling variables. The nice variable, which is the one that the user gets to set in a range of minus 20 to plus 20. The fault is zero. If you go up to positive, nice as that means, give me lower priorities. And if you go to negative, that says give me higher priorities. Typically, a user is only allowed to go to higher positive values of nice. And only the administrator is allowed to go into the negative range. The next two variables here, runtime and sleep tick, are tracking the recent CPU utilization and the recent voluntary sleep time. And so as you're running, we have a statistical clock that goes off. And each time a tick comes in, whatever the currently running thread is on that CPU gets charged with a runtime tick. Similarly, when you're not running, you are accumulating sleep time. Now, we don't do that with a clock. We just mark what time you want to sleep and note what time you wake up. And that's the amount of time that gets put into your sleep tick. We then have a priority, which is the current priority that you're running at. And a user priority, which is the priority when you're running at user level. Normally, these two are the same value. But when you enter the top half of the kernel to do a system call, your priority may be temporarily boosted because we want to sort of urge you to get through the system call and back out into user space so that we are not going to be contending as much for kernel resources. So what's going to happen is the way that we get these things to be recent, CPU utilization and recent voluntary sleep time is by doing this decay. And so we decay the runtime and the sleep ticks whenever their sum exceeds five seconds. So when the two of them add up to a total of five seconds, we essentially shave off 20% of their value. So we take the runtime and we shave off 20% and we take the sleep time and we shave off 20%. So the sum of them will now be four seconds. And then we recompute the priority based on these values when the thread either accumulates a tick or is being awakened. And our decision on whether it is batch or interactive is if the sleep tick exceeds the runtime. So if we're sleeping more than we are running, then we're considered interactive. And if we're running more than we're sleeping, then we're considered batch. All right, so how do we decide which CPU we're going to run on? Remember, we get taken off the run queue when we go to sleep. And so now when we wake up, we need to decide where we're going to go. So we're going to follow that hierarchy pretty much that I already described to you. First of all, threads can have a hard affinity. Either the application or the kernel can just say, this thread must run on this CPU. Or in the case of the user interface, you can actually say it's got to run on one of this set of CPUs. So you might say it can run on any one of the CPUs, but only these four that are actually on this chip are permissible. If you have an affinity to a single CPU, then it's easy. We just put you on that queue and we're done. Interrupt threads that are being scheduled by the hardware interrupt handlers are scheduled on their current CPU. That is the one that handled the interrupt in the first place. But if their priority is high enough to be able to run immediately. If it's not high enough to run immediately on that one because there's something at a higher priority already running, then we will look first at the last CPU on which that thread ran. And then we just walk down that hierarchy. So we will first look for the one that last ran on, and otherwise one that's on the same chip, otherwise one that's on the same board, other than same chassis, et cetera. In the worst case, we will search the entire system for the least loaded CPU running a lower priority thread. So if the search ends up offering a better CPU choice than the last CPU on which it ran, we will switch it. And the longer the sleep time has been, the more willing we are to switch it because the longer you've been sleeping, the less stuff that's going to remain in the cache. Because if other things run, of course, they're going to load the cache up with their data. So if you've run on there in the last millisecond, then you probably have stuff that's still in the caches. After a number of milliseconds, five or 10 milliseconds, then the chances that there's much useful in the cache is pretty low. And so the effect of moving you to some other CPU to run, we're not throwing away nearly as much state because there probably isn't very much state left on the CPU on which you ran anyway. The last thing is that we're going to have to periodically rebalance threads between the CPUs. And when the CPU idles, first of all, if it's got nothing to do, it's going to look around for other CPUs to see if there's one from which it can still work. This can be a bad thing to do. You see some other CPU and you grab it. Well, if that was the only thing that CPU had to do, all you've done is pull it, and it's now lost, it's caching. And now that CPU is just going to have to find something to do. So generally speaking, you won't ever steal work from a CPU that only has one thing to do. You'll just leave that there. But if there are some other CPUs, then we will potentially take work from that. And if a job gets added to a CPU that has excessive loads, it will look for other CPUs to which it can push work. So for example, I might have a thread that has an affinity to only run on a particular CPU. And so it goes to run. And this CPU is already full of a whole lot of stuff. It will say, look, I realize I have to run this thread, but let me just shed this other one off my run queue and send it off somewhere else so that I won't be so busy and I will be able to better serve this particular thread that wants to run here. And then approximately once per second, the ULE load balancer runs. And it looks across all the CPUs and finds the busiest one. And assuming that there's more than one thing on that CPU, we'll take one of those and find the least busy CPU and put it there. And so over time, if things get out of balance, this as sort of a backup will slowly migrate things around. And the question is, why would you wait one second? Why not make it half a second or two seconds? And the answer is that if you get too frenetic about trying to move things around, you're really just stirring the pot to no great effect. On the other hand, if you wait too long, then you can get a particular CPU that is very busy and it just takes a long time for it to get offloaded to other CPUs. The upshot is that, empirically, what we've found is that doing this about once a second gives you the right balance between being overly moving things around but also being able to get things moved around when it makes sense to do so. So thank you very much. And I look forward to taking the questions that hopefully will come streaming in. Thanks, Kirk, for that talk. I think I'll unmute you, Kirk, so you can answer questions. Can you? Great, I am here, ready to answer questions. Actually, there's been a big stream of things in the chat box here about the difference between the 4BSD scheduler and the ULE scheduler. So let me just sort of give an overview. What's been in there has been correct. Jan and Jessica have been doing a good job on explaining it all. First of all, there was the question of can you switch back and forth between schedulers? And as was correctly denoted in the chat, it is necessary to recompile your kernel. You can't just dynamically switch between them. And the reason is because, first of all, there's overhead if you have to make it at every scheduling decision of what scheduler am I using? And as I've already said, this is happening tens or hundreds of thousands of times a second. And you definitely do not want to add that extra overhead. And the other thing is that you can't have some processes running on one scheduler and other ones running on a different scheduler. You really just have to say the system as a whole is doing this. And so we deliberately made it a compile time thing, so you have to decide upfront what you're doing. And the overhead of allowing the dynamic switching is just too excessive to put that in. It's not that it couldn't be. It just is not a reasonable thing to do. And then the other question that's been coming down is what exactly is the 4BSD scheduler? And again, it's correctly listed. There is one queue that is global. There's a single queue that has 256 priorities in it, 0 to 255. And when something wants to go to sleep, we just start it at the top of that queue, scan down, find the first thing, and that's what we run. And there's not a notion of real time per se, except that you artificially raise priorities on things. But as a practical matter, there is this one global queue. There is a global lock that controls access to it. So as is correctly pointed out, that's kind of works OK up to about four, maybe six or eight processes. But at that point, the contention for that lock to get access to the queue becomes a serious bottleneck. The benefit of the 4BSD scheduler is it's drop dead simple. It's like 100 lines of code and 40 of those are comments. It doesn't have corner cases. It's just straight down the road, find a thing with the highest priority, run it. And the other issue with it is that once a second, we have to look at all of the runnable processes and recompute their priorities. So again, if you start ending up with something where there's hundreds of runnable threads, that starts to become a serious overhead. And as is pointed out at the ULE, we never end up having to look at everything. And other than perhaps when we're thinking of moving things around, but even there, we don't have to look at everything in any reasonable amount of time there. It's just we kind of ploddingly work our way through that as time permits, unlike the 4BSD scheduler. So the 4BSD scheduler works very well for small embedded systems. Anything up to about four CPUs, you just don't need the complexity that comes from ULE. But as soon as you get to anything over about four CPUs, ULE is just the winning strategy. Yes, Jan just points out the 4BSD migrates threads like crazy between cores, because it has no affinity. It's just first come, first serve. And so if affinity is gonna be an issue for you, then that's 4BSD is not your friend. Okay, so other questions. I don't think people can unmute themselves. So I think you have to just put them into the chat. Okay, so Jan writes, how hard is it to run heavy computation in just a few real-time threads? Well, that's largely a function of your application. If you can get it down to a few real-time threads, then obviously you're gonna make this scheduling easier. And note that you have the ability to pin things to particular groups of CPUs or even to a particular CPU. So if you've got four CPUs and you've got four things that are needing to run, you can just pin them to those four CPUs and then essentially you're doing your own scheduling at that point. So spaVU, I don't know how that gets pronounced. Anyway, the question is so for a system like Raspberry Pi with four cores, 4BSD might be good. It's certainly worth trying. It's pretty evident that it's either gonna work well for you or it's not. For many, many years, people that were doing embedded systems, they wanted a small footprint in the kernel and certainly 4BSD is a much smaller footprint scheduler than ULE is. I actually, when I gave the version of this talk at BSDCan, I had an hour, so I actually discuss a paper that was written that appeared in a USNICS conference where they compared the schedulers of Linux versus FreeBSD and one of the data points there was, 4BSD scheduler is 100 lines of code. The ULE scheduler is 2,000 lines of code. The Linux scheduler is 20,000 lines of code. So right there you can sort of see the scaling of complexity. It was kind of interesting paper and I do recommend it or you can go back and look at this talk at BSDCan where I talk about this for the better part of 10 or 15 minutes. But the way they did the comparison is that they went into Linux and they removed the Linux scheduler and put ULE in its place. So they factored out the Linux versus FreeBSD in all the other ways. The only thing that they were running Linux, it was just the change of which scheduler they put in there. And as they pointed out, it was easier to put ULE into Linux than to put the Linux scheduler into FreeBSD because ULE is so much smaller and more compact. Anyway, the bottom line is that for the most part the two schedulers do about the same. There are a few cases where Linux does dramatically better. There's a few more cases where FreeBSD does dramatically better. And the real issue is that the Linux scheduler has gazillions of special cases in it to deal with certain workloads. And every now and then you've got a workload where the Linux scheduler misgages whether it's a special case. And so it flips into using a special case which is a distinct bad choice. So they end up having somewhere they're just really, really bad. Whereas ULE, it does have ones where it doesn't do as well as Linux, mostly because Linux has a special case code for that workload. But it doesn't have any that are just horrible. Okay, so let's see. So, yeah, yes. We're also coming up on the, I think it's the, at 1725 I think the room will restart in preparation for the next talk. So it's probably better to move on to Spatial.Chat for questions. Okay. Well, I will move over there if people have questions they should show up over there. Right, okay. Well, thank you very much for your talk. Correct? It was really interesting. So I'm hoping we can do something like this in person sometime next year. Right. So thank you very much everyone. Right, bye now.