 Today I'm going to talk a little bit about the scheduler and load balancing in row 8 and and beyond a little bit upstream as well. So first just a quick rundown short biography about me. We'll have a little introduction to the scheduler and load balancing general and then I want to focus a little bit on a case study for a specific issue in the load balancing that has been addressed recently in in rel 8 minor releases and then final spend a little time looking at a couple of future features in the upstream scheduler that we're hoping to have in rel at some point whether it's rel 8 or rel 9 depends a little bit on when they land and how invasive they turn out to be and then I have some more info and links and whatnot at the end. So all right so a little bit about me just to get started so I am the maintainer and product owner for the scheduler component of the rel kernel at this point and I've been doing that for a couple of years now for about about as long as I've been at Red Hat I guess. There's still lots to learn for me it's a it's a new area that I haven't spent a lot of time with so I'm it's it's quite interesting and there's a lot going on but there's definitely a lot a lot of intricate bits to to learn. I've been working myself I've been working with the Linus kernel since since way back in the day I've done some network driver work some Zen and I worked on some IO virtualization stuff I wrote a multi-path driver at one point before the MD multi-path came about worked in some memory management pieces and various other bits but as I said I'm somewhat new to the scheduler but it's a lot of fun it's an interesting area so get right into it. So the scheduler is basically it's it's components of the kernel that determine when tasks run and where they run and for how long they run it it establishes the methods for the mechanism just the mechanism that we use to share the processing resources in the kernel and one of the main components of it is what's called the fair schedule class and most of what we're going to be talking about today deals with the fair schedule class and this gets its name from the completely fair scheduler also known as CFS which which is the implementation of what this this main sort of normal scheduling is this is the kind of scheduling classes you get behavior that you get with just when you spawn a normal task and most of the tasks running on your system are running in sked normal or what's sked other as it's also called. So in addition to determining when tasks run which is for how long and swapping them out when it's time to switch to a new task and whatnot the the where where tasks run is is the provenance of the scheduler as well and this happens in code called load balancing and so the goal of the scheduler is especially in modern systems where you know there's multi processors where back back when we had a single processor the where didn't matter so much right but now with multiple processors it makes a much much bigger difference you have to spread the load across all the processors as much as possible so and that's the goal of the scheduler really is it's what's called a work conserving scheduler and the idea is that there's if there's tasks to run we should find a CPU and a CPU to run it on we should find a CPU we should move it to that CPU and we should we should get running rather than having idle CPUs while you have work that can be done but now there are some constraints that affect how this load balancing can work these could be CPU affinity that's set by user space tasks may be configured to only run on certain CPUs some subset of CPUs or something there are also kernel tasks that are kernel threads that are pinned to specific CPUs obviously they are going to run on the CPU they're pinned on only they can't be moved there can also be power and frequency constraints so you could have processors in a very very deep sleep state you may may and you had a short task to run you may run that on a CPU that is not quite as deep deeply asleep so that you have a better latency and you don't have to bring a CPU bring the frequency up with CPU or bring a CPU out of sleep there could be a system with asymmetric multi-processing system with like a big little system where you have some processors are more powerful than others and you may want to run small tasks on the small CPUs larger tasks on the larger CPUs another feature of another constraint rather that that can affect load balancing is the non-uniform memory access systems where you have multiple NUMA nodes you may want to move the task to where its memory is in one NUMA node rather than moving the memory to the task so some some of these things play into how how and where we can move tasks but in general the idea is to spread the tasks out as much as possible the code in the scheduler tends to be performance critical in various ways not just in the load balancing it's a very hot path the scheduler code is very hot path in the kernel and so there are often trade-offs that are involved where you can make it better for one workload but it's worse for another workload and this applies also as I said to more than just a load balancing but but it there's there's often a balance we try to try to keep where where we get you know generally the best across the board that we can do without overly penalizing one workload versus another favoring one workload versus another so they're definitely trade-offs involved all right so let's go on to the next thing so so to look at the issue that I'm that I'm talking about it it comes from this paper called came out about four years ago five years ago now I guess call the decade of wasted cores it was a bunch of several researchers in France and Canada believe I have a link to the paper later on in the in the deck so you can get the actual details of it but what happens if they study the winning scheduler and came up with found these four things basically four issues in the current implementation of the scheduler and three of which were fairly straightforward easy fixes and went into upstream kernel around the time the paper came out but there was this other issue that ended up being called C group imbalance and excuse me so what what this this issue is is if you have multiple C groups on the system you can cause the load balancer to work badly if there are more tasks in one and and very busy tasks in another one of the and few busy tasks in other C groups so for the case we're looking at this is actually a setup that's specifically designed to trigger this so it's it's going to use three C groups we're going to run a benchmark a multi-threaded benchmark in one of the C groups and then we're going to run a CPU hog in each one each in the other two C groups so we're going to have a number of threads running CPU bound workload with with timing measurements in one C group and then we're going to put single single processes in the other C groups and we're going to see that the load balancing does not do what we'd really like it to do and we're going to do that by looking at some nice pictures so so this is what this is a heat map that is generated by some tools which are perf team perf QE team came up with based on the work in the in the paper I mentioned above and so what this is is a 48 CPU system so there if you see on the left that's the the y-axis is the CPU is going from 0 to 47 on the two NUMA nodes and there are two nodes so there's you know 20 23 0 to 23 on node zero and and then up on the other node on node one so they're 48 CPUs total and what we're tracking here is the number of running processes on the run queue of each of these CPUs and in Linux the number of processes in the run queue includes the one that's running on the CPU so when we have this at one which is the blue color that means there is one process running on that CPU black of course here is idle so these CPUs are idle on the node zero they're sort of in the middle of the picture you can see and of the hotter colors above blue as you can see a node one in the middle corresponding to the idle sections below or when we have multiple processes queued on this on one of these CPUs and that means that there are processes waiting so what really should happen is these processes over here should be filling in this black space and everything should be basically blue in this in this picture so this this is running our benchmark process it's from the NAS parallel benchmark suite and it's the LU program it's running 44 threads so there are 44 tasks running in that one C group and two CPU burning so that's 46 out of our 48 CPUs so this these long stretches of idle you can see that run across are sort of expected because we should have two CPUs that are basically idle in this system at this point and across the bottom of course is time the time doesn't isn't all that interesting as in specifically in this in this case as it shows it's more to show the heat map because it gives you a good visualization of what's happening and so this is what it looked like about that from the time of the paper up through what we shipped in rel a2 and it was the upstream kernel prior to v5 5 so this is basically what you would see running this test and this this led leads to a fairly significant overall this is not especially bad one but this leads to a fairly significant performance degradation for the the results of the actual benchmark itself in this case up to it can run up to one tenth as fast as it does in the normal case because it's not able to run it's a CPU bound workload and it's it's these these processes up here on the top half are waiting and work is is not being done right so if we go to 8 3 this is the exact same test on the 8 3 kernel and what's different here is that in 5 5 Vincent Gito did a rework of the load balancer and one of the things that it does is it solves this problem because instead of looking at the load of a system strictly at the load it actually takes into account the number of running processes as well when it's going to make its load balancing decisions so what was happening in the previous picture has to do with the way the C group scheduling works and the way it calculates load in order to keep everything fair the load a process applies to a system and it the measurement of the load of a process is divided by the number of processes in the C group that it's in so in our C group with the 44 threads those CPU bound processes were taking 144th the amount of low or claimed to have 144th the amount of load as the CPU burning CPU Hawks that were running you know as singletons in there in their C groups and so when it was doing its load balancing calculations it's using that number which is which is skewed because of the group scheduling policy and in this case the load balancer now looks at the number of running processes first and it only uses that load calculation for fairness cases when the system is really overloaded which is not the case in in this setup because we have as I said 46 tasks using 48 CPUs so it really shouldn't be overloaded so it should be strictly looking at the number of running tasks per CPU and you can see basically it's all blue and we've got more or less two this guy you know the idle is moving around a little bit here but we have more or less two full threads of idle and 46 CPUs in use and so that looks a lot better than that we think we're all we're all done everything's good now and we can go on how and the performance is back to what it should be so you get the same performance now with the C group case and without but we're not quite there yet because if we look at this next graph this is still on 83 this is with 24 threads in the benchmark case so still two two CPU burners in their own C groups and now instead of 44 we have 24 threads which means there's a lot more idle as you can see but there's also this interesting saw blade sort of behavior which you can see pretty clearly in the visualization here and what this involves what this means is we've got tasks that are they're just moving back and forth from the from node to node right there's a steady number of tasks in the system and they're just getting migrated back and forth between between the two numinodes now the end result of this is only a couple percentage point performance difference and there is a there is a cost there'll be some cash you know some cash damage by by moving the the processes away it wasn't quite as obvious without the without the visualization you know you wouldn't might not even have noticed because the the difference wasn't quite as great but this is still not not really what we want and so what's happening here is I mentioned that there's the two numinodes right there's actually two pieces of code that do load balancing at the numal level in the kernel there's the schedule load balance or itself which is what we talked about before and what we fixed in the in the eight three kernel and the five five upstream kernel but there is also this auto numal balancing which specifically can balance tasks and move tasks at the numal level and with the five five kernel still and eight three here the the numal balancing code is not using the same logic as the new load balancer and so we have here is they're basically fighting with each other so you have the load balancer trying to fix still balance across the numinodes and you have the numal balancer trying to put them back and and vice versa and you end up with this saw blade effect so in about five nine and coming out in eight four when it comes out mel gorman who does a lot of the numal balancing the new the numal work wrote a patch series that we took in eight four that brings the logic makes the logic the same so that the numal balancer and the load balancer in the schedule are basically used the same criteria to figure out what to move and and how to do their load balance so they don't fight with each other anymore and in that case we get something like this which is a lot better the tasks are not moving around they're still moving at startup because they all sort of start on node zero you can see at the very left but then they just get moved and then basically from then on everything is pretty pretty even and balanced all right so anyway i thought those were interesting set of traces that you can use to see really what the load balancer what the load balancer is doing and here's a little bit more detail about how we did that this is the point of the first thing is the NAS parallel benchmarks that's a pointer to the test suite that we were running the trace point itself is was added to the five nine kernel and we it actually only lives in a four as far as rel is concerned but we back ported it to the other kernels just to do these tests but it's also it's a raw trace point so it's not a full trace event so it requires a an external module to turn it into a full trace event so your trace command tools and and stuff can actually access it access it directly and these are pointers to this module this top one is my version which is just a fork of case usefs version that had a bunch of other trace points in the scheduler area that used the same mechanism to make them usable usable trace events so i just added added to that and and it's been pushed upstream so this this bottom one also has the trace event mechanism for for the number of running tracking and the graphs here were made by this tool plot number running by jerry vosar who's member of our perf qe qe team and while i'm here i just want to do a quick shout out to yorka who has been a great help with this and actually did the work to give me the tool to run this to do the setup for running the tests that you saw in the first the first slide okay so we got just a couple minutes here i'll go very fairly quickly through these pieces and leave some time for questions if there are any um so a couple of future features that we're looking at um that are going to be interesting there's this um latency nice feature it's not going to be called that because it's not really going to be quite as parallel to to the nice functionality that's in the kernel already but this this is a way to tell the scheduler that um your process is uh latency sensitive and there are things about in the wake-up path and whatnot that can be that the scheduler if it knows that you're trying if your task is latency sensitive it can it can try to play into that and make it and make give it better latency this will probably end up being more of a switch than a than a knob of of multiple values but we will see i think i think that'll be interesting for some workloads and so linux weekly news article that that's pretty up to date talking about that oops sorry um there's some work being done in the isolation mechanism which is also basically part of the scheduler in a lot of ways and um one of the things it can do is is turn off the timer tick in on a cpu so you can set no hurts full and you can set a cpu isolated which will try to move move all the other threads off of that off of that cpu so you can run your really run your workload you know in in complete isolation on a cpu the problem is that it's it's um it's boot time configurable right so you can't change it at runtime and there's certain amount of interest especially in like the container uh container virtualization stacks and stuff to be able to change this configuration at at runtime so there's some work ongoing there um there isn't i haven't seen a there are pieces there isn't really a comprehensive project sort of to do this so i don't have a good link to point to anything for that um there's also another way to do isolation which is this task isolation mode and they're sort of related this is a process to a project interact to but we have a question here and we have three minutes to go it's already done uh how do you measure the cpu to is it utilization without interference from the program that is actually measuring the observer effect or is it negligible um you mean for the for the tracing for the pictures that we're looking at i assume i guess uh yeah so i haven't measured that specifically it's fairly it's fairly negligible it's the it's the kernels tracing me it the kernel has has these uh these i don't know it's the f trace events that that it uses that are there are fairly they're fairly lightweight and they don't add a lot there there is definitely some interference um i didn't want to point that out but i think some of the i just quickly go back so i think some of these little green dots that you can see where there's an extra process running you know some of that may be the um the trace command threads that are that are waking up to uh to to record the data but the overhead of actually tracing using the trace points is very low okay uh does anyone have any other question feel free to post it into a q&a uh you still have minutes to go and then we can move to the discord in the room session room two and you can ask there after the tall ends right all right sounds good so let me just uh if there aren't any more questions yet let me just mention that also um core scheduling is is an interesting feature that that i wanted to make sure i got to mention in here um that this is a way to um have the scheduler enforced that only tasks that trust each other will run on the siblings of a hyper threaded core at the same time it's originally a security feature there's some other pieces that are needed to make it really fully work as a security feature um but but there's definitely some interest there and and it's it's getting close to being merged i'm not sure yet when it will be in there's still some interface work being done um i believe it's in peter's e's huge q at this point so maybe 512 or 513 it might actually land and uh our time is up so uh thank you for your talk philip and uh see you next time