 So, this is the usual thing. This has been a group effort. There's a large number of people that have made this happen. I've played a reasonable part, but not the major part, many stretch of the badger nation. There are a number of people in this room. I see one there and a few other places. Christoph and probably some other guys I'm missing that contributed to this. The guy that probably put the most work in it is Frederick Weisbecker. And I have to give a special tip of the hat to Thomas Gleitzner, who is the guy that convinced Frederick that he should do this for his master's thesis. So, you know, with that, kind of things I'm going to go through here. And at my age, of course, the first thing you do is go back to the good old days. It's just a natural thing. If you aren't of this age, you'll find out. Soon enough, don't worry about it. You know, in the 90s, CPUs had no energy efficiency features whatsoever. In fact, in the first maybe half of the 2000s, most didn't either. I mean, there were smartphone and cell phone CPUs that did not worry about it. But the server class CPUs, actually the idle state was the least energy efficient, the most energy used. And the reason for that was there were no cache misses, and so the ALU was kept fully busy, consuming power as much as it could. And that actually meant that if you had extra scheduling clock interrupts, your system was more energy efficient, rather than less. But things have changed over the past 10 or 20 years, and right now, idle CPUs are preferably powered off CPUs. And even the high-end systems tend to be pretty good, about not as good as the battery-powered guys, by the way. I mean, those of us in the service arena, server arena, doing the big servers, we have energy efficiency as a first-class requirement. The guys down in the battery-powered embedded range don't merely have it as a first-class requirement. For those guys, it's a fundamentalist religion. Okay? You think I'm making that up? No, well, we'll get to that in a little while. But what happens now is you want the CPU that's idle to be powered off, and that means you don't want it to get scheduling clock interrupts unless they're really needed. Okay? And again, especially on battery-powered systems. I've had guys call me up. I mean, LKML wasn't enough for them. They by God called me up on the phone and screamed at me, because some scheduling clock interrupts that RCU was causing were burning 40% of their battery. And they didn't like it. They were very upset. And they let me know. So, you know, you think I'm making that up? No. Anyway, this is kind of the way things are if you don't control your scheduling clock interrupts when it's idle. You'll have your clock eating up your battery causing a lot of problems. And that's not what we want. I mean, we really, really don't want that these days. And if you don't believe me, put a patch in that does that and see what phone calls you get. Okay? Well, this is kind of what we want instead. If the system's idle, we want it to be asleep. We don't want to be consuming power. We don't want to be draining the battery. And we want something like that. And for idle, we've had this for a long time. Before, we had dyntic idle. This is back to early 2000s, very early 2000s. We'd have these CPUs. We have time advancing from left to right. And the vertical red bars or scheduling clock interrupts. We have CPU 0 and 1. And if they go idle, well, they kind of get low power, but we still have these physical clock interrupts showing up. Now, modern CPUs are often really good about getting into low power really quickly. But even so, you have them spinning up and then spinning down. And that does consume energy. And if you have that happening, your battery won't last very long. And it's actually kind of inflight if you think about it. I mean, would you like it if every Jiffy, while you're trying to sleep, somebody wakes you up? I know I wouldn't like that. So this is the way things work today. If you have config no hurts enabled. You have your CPU. It goes idle. And there are no more scheduling clock interrupts. That allows the CPU to get into the lowest possible power state and stay there. And that means that you have very little CPU energy consumption at least from the CPU. If you have a display, if you've got mass storage, other things, the network, well, you've got to worry about that separately. But I'm worried about the CPU and this really helps with that. So this is great for energy efficiency. But more recently, there was cases where extra scheduling clock interrupts to user space execution were causing problems. What happens is if you have real-time application or a high-performance computing application and you have extra scheduling clock interrupts, you can actually really degrade the performance. For real-time, of course, you get that extra scheduling clock interrupt duration, the worst-case duration added to your worst-case latency. And for HPC, if you have iterative computations, you can end up multiplying the effects because one CPU gets slowed down and everybody else has to wait for it before you can start the next iteration. So there are cases where those cause problems. And the other thing is if you're running that kind of application, this heavy-duty HPC computational application, you have one runnable task per CPU. I mean these guys go to links to make sure that the worker thread is the only thing on that CPU interrupts other threads, whatever, off. And so that CPU has no job whatsoever aside from running that one thread. And that means that if we interrupt it, all we're doing is slowing things down. I mean it's the only thread there. There's nothing else for the CPU to do, so what's the point? So what we want to do instead is if some other task shows up and we have two, then we start interrupting it. But if we're back to the point where we have just one task running on that CPU, we don't want any schedule clock interrupts. And also if you're busy, as well as if you're asleep, getting interrupted frequently isn't helpful. So we want to avoid that. Alright, so what we can do is we get rid of these things. And an intern of mine, Josh Trepley, he's now off and doing Chromebook stuff, prototyped this in 2009. And Anton Lancerd actually did some benchmarking on some HPC-style benchmarks. Without that fix, that red stuff shows up. In other words, we're losing about 3% performance. And that may not sound like much. But if you have people that are trying to get the most out of their system, that's a big deal. With Josh's patch, we end up with that. And we'll just go back and forth here. This is without his patch. This is with it. So this is really a worthwhile change. Except that there were some problems. I mean, this was just a prototype. Therefore, a user application could monopolize the CPU. If the user application started running and decided it didn't want to stop running, that was all that CPU ever did. And you can't have that. There was no process accounting. And for a lot of applications who cares, but there are places where you do care, where you want to know. And it's also useful in some cases for diagnostic purposes. So we do need that. And closer to home for me personally is that RC grace periods could go on forever. And eventually the system runs out of memory. It's not a good thing either. So Frederick took on the task of fixing this for x8664 and also for the core kernel code. We had some people porting it to different architectures. There's probably some more by now. And it's actually in mainline now and you can run it. And it actually works fairly well. The top kind of is a diagram without, with just no hurts. And you see there aren't any scheduling lock interrupts, no vertical red lines above the idle part. But then you get them if the CPU is doing anything, whether it be kernel or user mode. With no hurts full, nothing happens until you get a second task awakening. Alright. And we do have a residual one hurts tick. You can get, there's a patch to get rid of that. It's kind of a security blanket I guess. So we aren't sure we've taken care of everything. And so if you want to get rid of it, you can. But you have to, you're taking the responsibility of dealing with it if you've gotten rid of it, okay. But this works fairly well. And as mentioned before, it's in 3.10. Which has consequences. One is this enabled by default in rail 7. And when I first saw, heard about that, I was like, yeah, this is really cool, you know. Despite having more than 40 years of software development experience, that was my reaction. I mean, you'd think I'd know better by now, but no. Ooh, alright, you know. And that means it's used by everybody. Not just by high performance computing in real time. That means at that point the real validation began. Because I headed into my head when I started this process that the only people using this would be building their own kernels. And therefore that it would only be used if you had serious HPC in real time workloads. So we didn't really bother checking it out for anything else, right? Yeah. Right there for that. Anyway, Rick Van Reel helped get me out of that state of mind. He sends me an email saying, hey, you know, I've got this system. We've got more than 40% of the CPU showing up on this task called RCU SCED. That's something to do with you by any chance. And why is it doing that? Well, okay. What are you doing here? First off, these things only do a little bit of work at the end of each grace period. So 40%? What? So, okay. Ask him what he's doing. And he's got an ADCPU system. That's not that big. I mean it's big, but there are people that run 4,096 CPUs and run Linux and maybe more by now. The biggest system I've ever gotten a bug report from is 4,096 CPUs. Okay. So ADCPUs, what's the problem here? Well, the next piece was that he had a context-switch heavy workload. He's running some kind of a Java application that was just context-switching the whole system silly. And that was not what we designed no hurts full for. Okay. That just wasn't where we were focusing our effort. Well, okay. Sometimes you want to just go with a knee-jerk reaction, right? Sometimes you just, you know, let's try this, right? And let's try this said, all right, well maybe the grace period is just going through really, really quickly with that kind of workload. And so maybe if we just artificially slow the grace periods down, we can get the CPU to go down. And if that works, we can figure out what we really want to do. Unfortunately, that had no effect. Which, of course, meant I actually had to really analyze the problem and figure out what was going on. Yeah, it's really tough sometimes, you know. I don't know. So to kind of see what's going on, this is kind of a cartoony view of a little piece of RCU. So what we have in the middle there is a combining tree. And this is a scalable way of gathering data about all the CPUs and recognizing when they've all gotten to a safe state so we can say, hey, everybody, old stuff was done and we can let the grace period go and get on with life. So each CPU feeds in, there's actually 16 CPUs by default for each of the leaf RCU node structures. I'm only showing four because I'm not that great with artists here. And those run up the tree and when you get to the top and everybody is checked in, you say, hey, we're done. And then the grace period K thread says, okay, great. Take care of things set up for the next one and go again if we need to. And if you run that way, even with the Snowhurst full thing set up, even with the recent kernels, everything's fine. Very low CPU consumption on that grace period K thread. So that's great. But the thing is, if you actually have the Snowhurst full stuff enabled and are running it all the way, what we do is we offload the RCU callbacks. Without this, each CPU does its own callbacks. So they schedule a callback and we'll look at what those are a little bit later. And they do their own thing. Otherwise, what happens if you're running the way that Rick was running his system, they get handed off to offload K threads, one per CPU. The offload K thread waits for grace period and then invokes the callbacks. If you do it that way, 40% CPU consumption on an 80 CPU system. So what are these callbacks and why do we care? I mean, if callbacks are causing problems, let's just get rid of the callbacks. Well, we go in a lot of detail, but for this presentation, think of it as just a way of delaying work. And so we have a per CPU data structure and each of those things has a linked list hanging off of it. And we've got these RCU head structures. The head structures are very simple. They've got a next pointer to link the list together and they've got a function pointer. So we record work by putting the function in one of these things and saying, hey, do this sometime later. And then sometime later we scan the list and call the function passing the address of the list element as a parameter to it. So it's just a way of doing procrastination. You know, when it's safe to do this, do this thing and it comes along later. And the advantage of that is having that functionality enables very, very fast read side access to the data structures. That's why we go through all this stuff. So how does this work normally if we aren't offloading the callbacks? What's going to happen is we have our scheduling clock interrupts and the CPU is going along doing things. And at some point we cue a callback. Okay, do this sometime later. And the hard clock interrupt, the scheduling interrupt, notices oh look, everything's done right now. So invoke soft IRQ, that's the yellow bar, and invoke the callbacks. All right? So again, we're just delaying work. We couldn't do it here because it wasn't safe. It became safe sometime later and at that later time we did it. So what we're doing here is we're tapping the awesome power procrastination. It's a great thing. There's one problem with procrastination though. Sooner or later, you're going to have to do the work. This isn't like time interrupts. I mean a timer, you know, if you post a timer, somebody might cancel it. You know, if you procrastinate, maybe you won't have to do the work at all. But with these things, once you post it, that work has to be done at some time in the future. And it will be done. You can't not do it. You just have to choose your time to do it. And because we do them by default in this high priority soft IRQ environment, what's happening is we've posted the callback, eventually decides that everything's okay and it invokes it and it interrupts some poor user thread that was really important for something that, you know, all we're probably doing is freeing memory. We could have delayed that. We didn't need to interrupt that thing. But we did. Okay? And this is why we offload the callbacks. If we have a CP we know is always going to do something important. If it's so important we shut off the scheduling clock, interrupt some user mode code, we don't want it doing RCU callbacks. Okay? Or we don't want to be doing them at high priority anyway. So that's the reason for offloading. So this is without it being offloaded, we end up interrupting ourselves in the future, possibly at a really bad time. If you're just doing throughput computing, you don't care. There is no real bad time. If you're doing high performance computing or real time, there can be really bad times. So what we do for offloading, this is something that Jim Houston and Joe Corte did a prototype of this in their own version of Linux. And it took me several years to figure out how the heck to incorporate that in the main line, which I eventually did. So it's their idea to begin with. What the idea is, instead of having each CPU do its own callbacks, you have these K threads, one for each CPU. O stands for offload, RCU offload K thread. And rather than having a CPU interrupt itself, it just hands it off to that K thread and that K thread does the callback. And you can do any number of things. You can just have the RCU K thread by default runs at normal priority. If you have real time workload, your important work will preempt the callback and make it wait until later to happen. That's one approach. Another approach that's fairly popular is to reserve a few CPUs as kind of housekeeping CPUs, as sacrificial lambs if you will. And what you do is you just gather all those RCU K threads and force them to run on those housekeeping CPUs. And that way you're guaranteed, then that stuff's going to interrupt anybody on your important worker CPUs. So however it works for your workload, it's something that the system administrator or whoever is sitting up the system can make their choice. RCU doesn't care. So this is wonderful. We can eliminate disruption, but 40% CPU is kind of a high cost for that. And furthermore if you didn't care about disruption, 40% CPU is unconscionable. If you wanted those CPUs to be doing some work grabbing 40% of one may be really, really bad. And this is kind of unfortunate because I was hoping to be able to replace the old style callback processing with the offloading. Everybody's offloaded all the time, but obviously that's not going to fly. At least not yet. Not unless we can figure out some other better way of avoiding consuming so much CPU. Alright, so we've got this bug coming in. It's kind of late in process. So the first thing is to stop the bleeding for the innocent bystanders. In other words, figure out some way of working out who cares about disruption and who doesn't. And make this not hurt the people that don't care about disruption. And this initial bandy was fairly straightforward. We used to have it set up so that if we saw any sign that somebody wanted no hurts full, we offloaded everybody. Just sort of like, oh, they might worry about this, we'll offload and just not have to worry about it. So the change instead was to offload only if the no hurts full boot parameter was set. In other words, not only was the kernel built to do offloading or to do no hurts full, but this specific CPU was put in the mode where it wouldn't get schedule clock interrupts. In other words, this particular CPU had been designated as an important CPU. Then and only then we offload from it. And so there's a get commit down there at the bottom that takes care of that. Except for one slight thing. It's kind of embarrassing, but industry experience is that out of every six fixes, one will introduce a new bug. And that's kind of been consistent for several decades. Linus may have better, more recent data, but who knows. In any case, this was one of the six. What happened, it was a surprise to me. It turns out RCU is used earlier in Booth than I thought. And what happens then is that, you know, what happens is they register this callback and they put it in and at that point it doesn't know whether the CPU is special or not. And so it puts it on the non-special list. And then later on it says, oh, this CPU is special. So we'll start using the non-special list. And then that poor callback that was put on the non-special list never gets executed. Which is a problem if somebody was going to wait on it. I mean, if you're just freeing memory, so what? I mean, you got some memory that never gets freed up. But, you know, if somebody ends up waiting and that was going to wake them up, you got a system hang. It didn't happen to me because I didn't have that set up. And it didn't happen to Rick, so his test came out just fine. But it happened to Amit Shah. I think it was. Anyway, so here's a simple fix. Instead of just automatically waiting until you have to, what we did before is we waited until it was legal to create k-threads. And then and only then we made our decision. And callbacks were posted before then. So we just make the decision earlier and boot. You know, if somebody is posting callbacks before I see Amit, what are they doing, right? So we make that decision early. And then that fix takes care of that. And that means we stop the bleeding caused by the previous bandage that stopped the bleeding. Now it's time to fix the bug. Because we still have these people that want to do offline loading. And I think 40% CPU on 80 CPUs is excessive. If you don't believe me, keep in mind, I said earlier, that I've gotten bug reports on RCU from systems with 4,096 CPUs. And perhaps some of us get into the, you know, give 110% thing. But if you do the math, if you have 4,096 CPUs, that's 4,096%, excuse me, 2,048% of the CPU consumed by the thread to keep up. And, you know, 110% gung-ho is one thing, but 2,048% is just stupid. And what's going to really happen if you try that is the grace periods are just going to lag way, way behind. And it's going to, you'll possibly run the machine out of memory. It's not a good thing. So one solution to take care of this is alright, let's at least try to spread the pain out. We have this one thread that's waking up all these other threads. And that's what Rick determined by doing some profiling to be the problem. To be the thing that was contributing to the excess CPU. So what we should do instead is just wake up some of the threads and then have them wake up the rest. If you can't solve the problem, try to sweep under the rug a little bit, right? And that's what this commit does. Now the other thing is good about it though is it paralyzes the wake-ups. And we'll see that in a moment. Just to focus where we are on that previous diagram, everything's the same and what we're doing is we're worrying about the interface between the grace period k thread and the offload k threads. So we're going to change the wake-up logic there to make things run in more in parallel. And also to move CPU from the top there down to the bottom, spread it out over more threads. Alright, so this is how things looked beforehand. So we have this RCU and I couldn't fit 80 CPUs on the page so we only got 20. You'll use your imagination for the other 60. So we got this RCU grace period k thread and it does a series of 20 wake-ups one after another. Okay, so that means that it takes a while. Wake-up takes a couple of microseconds. Maybe it depends on the system. And so if you have 20 of them, that's maybe 40 microseconds. If you have 1,000 of them, that's 2 milliseconds. Alright, so this isn't really very scalable in some sense. So the hierarchy is actually kind of a good thing. And what we do is we just take and square at the number of CPUs and there happens to be an integer square routine already in the kernel so that was convenient. So in the case of when we round down, so we take our 20, we get 4. So wake-up 0, 5, 10, and 15 with the grace period k thread and they wake up their own group. So that means that this guy is only doing 4 wake-ups instead of 20. That means if we have a huge machine, the number of wake-ups rises by the square root. So it's a reasonable number even with thousands of CPUs. And it means these guys run in parallel. So we can get everybody woken up in parallel and have reasonable response time to the end of the grace period. But interestingly enough, I mean, all I was trying to do was sweep stuff under the rug and by accident I got rid of some of it. Sometimes you get lucky, you know? Sometimes it works. If we have a busy system and that was what Rick was having, he had this system that was doing Java and constantly switching itself to death and sending callbacks all over the place. And in that case, each CPU is going to post an RCU callback every grace period. So every time RCU is going to have something to do on each CPU all the time. And the old way of doing that, what happened is that each of the RCU offload K threads is going to be woken up when it gets its first callback. The guy does, okay, call RCU, here's a callback. And he says, oh, is this guy running? No, he isn't. Okay, wake him up. Okay, so each of these offload K threads gets one wake-up per grace period when it gets its first callback for that grace period. And then what happens is that they all say, great, I need a grace period. So RCU does this thing and then the grace period goes and wakes them all up. So each of these K threads gets waken up twice. So we have a total of 40 wake-ups for our 20 K threads. It would be 160 for the 80 that Rick had. And those are two per K thread and 20 of those, half of them, are done by the grace period K thread. Okay, if we do it the new way, what happens is that 0, 5, 10, and 15 are waken when their first callback is posted. And that's not just their callback, it's theirs or any of their followers. So if CPU3 suddenly says, hey, I got a callback, he'll wake up the 0 K thread. And then what's going to happen is that each of the K threads is waken at the grace period ends. And those four are going to be waken by the grace period K thread and the remainder are going to be waken by their corresponding offload K thread. And that means we only have 24 wake-ups. Because we didn't have to wake up the other 16 offload K threads when their callback showed up. What happens is the leader got waken for them. And that means we had a 40% reduction in total wake-ups and an 80% reduction in the number of wake-ups that the grace period K thread did. It went down from 20 of them down to 4 and we went down from 40 total to 24 total. Alright, so you can see again going back to this diagram what's happening here. The top guys, the top four are being awakened when their callback is posted. The bottom 16 are not because the top one gets awakened on their behalf. And then everybody gets awakened when the grace period ends. The grace period K threads wakes the top four and each of those top four are waken the four underneath them. Alright, so we spread things out and by luck we reduced it a little bit. And if you add more CPUs it gets better. This is a plot, we got a percent reduction in the wake-ups and then on the bottom we got the number of CPUs. I went at 512. And you can see the number done by the grace period K thread is reduced by, when you get to 80 CPUs being reduced by somewhere between 80-90%. So it's a pretty large reduction. And the total is being reduced by over 40% and as you get larger the reduction approaches 1 half. 50% reduction. So this is actually saving a fair amount. There is one dark side to this of course. And that is that the four offload K threads at the top, the ones that had to do all the work and wake up their followers, they're going to see an increase in wake-ups. And we expect if we kind of extrapolate from Rick's setup we'd expect if we had 512 CPUs we'd have a little over 10% CPU utilization on each of those. And there'd be what 9 of them in his case or 8 of them, excuse me. And they'd be seeing about 10 or 11% CPU utilization. But only if you had a really serious context switch heavy workload. And you were running and you had designated that all the CPUs as being worker CPUs that were not supposed to be interrupted. If you just booted normally and didn't have any of that you'd be using the old method of callback invocation and you wouldn't see any of this overhead. If you did designate a bunch of CPUs as being worker CPUs and did it the way that we intended, which is if you have a lot of CPU bound stuff or you have occasional real-time workloads, then again you wouldn't see much traffic in RCU and your CPU utilization would be quite low. But even if you're using them properly it's not too bad, even with large numbers of CPUs. So at some point if we have really big systems and people for some reason run lots of really heavy duty context switch workloads in this configuration. And that might happen. I mean you might have some workload go and you do some heavy duty HPC for 10 seconds and then you spend a second doing some stupid thing that does a lot of context switches and then you go back doing heavy duty HPC. I don't know why anybody do that, but I've been around for a long time and there have been a lot of things that happened that I would have never imagined. So there's some chance this will and if it does we add another level of hierarchy. Instead of having leaders and followers, we'll have leaders and followers and follower-followers or something. But you know something? I bet that a lot of other problems in the kernel will show up first if we do something that does huge amounts of context switches on 500 CPUs. Anyway, so looks pretty good. Not only we spread out the workload, we reduced it. Sometimes you get lucky again. And the systems with lots of CPUs aren't likely, at least they don't now, run these heavy duty context switch workloads. So what's not to like? Well, yeah. You guys know the drill. I said it was the energy to Andrew, which I didn't say it was any good. That's right. Overachieving. And actually the thing is that I'll cover this a little later. There's a lesson hangs off of that interchange and we'll hit it in a later slide. Okay, anyway what happened is that Amit Shah was reporting system hangs during boot. I wasn't seeing any. Rick wasn't seeing any. Anyway, guy named Pranith Kumar actually chased it down. Really good effort and got things going and he actually produced the fix as well. And the problem was in certain configurations situations, you would start the leader K thread and if it happened to have a callback and just the right stuff was set up, it just would never process it. You'd have to give it another callback before it start. And if the first callback was something that the whole system was waiting to get done, it just would never happen. Okay, so you could sometimes just write, get a system hang and the fix is that you just pretend you, when you first show up, you pretend you got told you have a callback even if you don't and then get things started and life is good. Well, you guys got the idea. So Paul Gortmaker Cosmos IRC says, you know, why have I got some of the RCU key threads? I can't remember that 8 or 16 CPUs but whichever it was, he had like 240 of these things which is a bit excessive given that you're not supposed to have like either 2 or 3 per CPU. So even if he had 16, that'd be like 48 at most. Well, it seems that all sorts of things people in computers like to lie and there's firmware that likes to lie about the number of CPUs on the system. It likes to exaggerate. So his firmware was saying, yeah, I got like 80 CPUs and RCU was stupid enough to believe it. I got 80 CPUs, okay, 240 K threads, no problem. And of course, these things just set idle which isn't too bad but they were consuming a lot of PS overhead and if you do PS, you see all these things and if you are on a tight memory machine although if you have a tight memory machine, maybe you should code your firmware more carefully but whatever. If you had a tight memory machine, they'd be using a memory that can be better for purposes. But this is pretty easy to deal with. You just don't create K threads until the CPU comes online right? So if the firmware says CPU say okay, I'll take your word for it for right now and then when the CPU actually comes online, then you create the K threads. And that means that if you have some bit of firmware that says I've got 80 CPUs and it's really not three, you'll only see the three come online and say okay, nine K threads, you're done. We had to do a little work here. The first commit kind of factors the thread spawning a little bit and then I was able to do something that creates them only for online CPUs and this worked great, worked great for Paul, worked great for me. But I think you guys know the drill by now. The only reason it passed the test was that neither Paul or I had modules enabled. And plus even if I had modules labeled, my firmware doesn't lie about the number of CPUs. And it turns out that module removal often uses something called RCU barrier. And so what the heck is RCU barrier? Well, suppose you have this kernel module and the kernel module uses called RCU. So it's saying okay, here's this function I want you to call sometime later. It's one of my functions in my module and when it's safe, call it. And what that means is this call of my func, this function is going to be called sometime later and if RCU is really, really busy on the CPU that happened to have that callback, it might be a long time later because a grace period takes a certain period of time and if you've got 10,000 callbacks, which by the way is not impossible. I have special handling in RCU for if a CPU suddenly gets 10,000 callbacks and it does happen sometimes. Then it's going to maybe a good long time after that grace period is before your callback finally gets invoked. And in fact it can be long enough that somebody might have removed the module in the meantime. If that happens, your my func is out of memory and when it tries to call that, nothing good is going to happen. This can be embarrassing and somewhat fatal to surprises. This has been around for a long time. RCU barrier went in, I don't know, 10 years ago or something like that because of things like this. And what it does to prevent it is it waits for all the previous callbacks to be invoked. And so what we do when we have that is that we have the kernel module do is call RCU and it might be invoked a long time at CPU maybe because it may be invoked a long time later. We don't know when. But we unload the kernel module and that kernel module does RCU barrier. And that RCU barrier waits for all the callbacks that have been previously registered to get invoked. Including the callback that's going to call my func. And then and only then we unload the module and that's okay because all the callbacks are done and there's no problem. And there's two things that make this work. One is that what RCU barrier does it posts the callback and all the CPUs that have callbacks and it waits for all of them to be invoked. The other thing is that RCU very carefully makes sure that it invokes the callbacks from a given CPU in order. Okay. And because we do that you can just post those callbacks, wait for all to get done and then you're fine. This is kind of some pseudocode about how this happens. You can put online CPUs because if CPUs come and go while you're doing this you can get really confused. You set a counter to one because you want to avoid some races and for each online CPU if that CPU has callbacks you post another callback to it. If it doesn't have callbacks ignore it, no problem. Because you only have to wait for the callbacks not for the not callbacks. And then what the callback does is it atomically decrements that counter the only set to one and if it's zero it wakes us up. Then we put online CPUs and we got everybody set up if they want to bring CPUs on and off after that, no problem. We atomically decrement the counter which avoids races and callbacks and more online CPUs imposed. And then we just wait for that counter to reach zero. The problem is if we have CPU offloading we're sending these K threads. Those callbacks might be around a long time after the CPU goes offline. Because what's happened is we've said, okay, you K thread, here's a callback. That K thread might be blocked by high priority stuff for just being busy or whatever and the CPU can go offline and that callback's still there. We still have to wait for it. So if we have an offloaded CPU we have to give it a callback anyway even if it's offline because it might have some delayed callback still there. Because just because it's offline is K thread still there. Unfortunately on this new world where we try to deal with firmware lying to us, the CPUs that are never online don't have K threads, don't have offload K threads. And that means if we post a callback to them because they've been offloaded, that callback's never going to be executed. Which means that our two barriers are never going to come back. Which is not what we want. And anyway this is a fairly easy change. We add to a condition that by the way the CPU offloaded also has to have a K thread already. We have to create that K thread and then not only then we give a callback to it. So to get another commit to take care of that. And this one I actually found by inspection. I started getting a little paranoid by this time. It's one of these things where you kind of have this reflex action, there's a bug, okay I can fix it here. And then you forget about yeah this is our consequence over there like an RCU barrier. And by the time I got to this point it's like okay fine I'm going to roll this code and stuff popping up at our intervals. And by inspection I found this thing. What happens is that the leader follower lists work just great if the CPUs come online in order. If they come online out of order then you can lose followers and then those followers callbacks never get processed. So at that point I said alright fine we're taking this code in user mode and we're making an exhaustive test for the whole thing. I should have done a start with by the way. Anyway coming back to some of the lessons we learned. One of the things you want to do is limit the scope of changes. Had we stopped to consider the know-hers full would be used on any workload at any time by anybody rather than assuming that oh if you want to use this you recompile your kernel today and enable this K-config in the parameter which is where we are naively sitting. You want to make sure that you don't put innocent bystanders at risk. If there's somebody not needing your feature make sure it's disabled or otherwise rendered harmless. The other thing is that Linux has a rather amazing range of workloads and hardware and there's no way you're going to test everything because there's just too much to test and so that puts more emphasis on the first part. Because you don't know what's going on all the time make sure that you affect only the stuff you need to affect. Try to do no harm where you're not needed. Of course as you might have noticed through this presentation fixes can generate additional bugs. In fact there's a guy that said Murphy that says they will generate additional bugs and there's another guy that says Murphy's an optimist and that's the guy I agree with. Now sometimes you get lucky as I did on the one fix where I saved CPU time instead of just shoveling it in a corner. But the big thing with all of this all these bugs the original bugs weren't that big of a deal. I mean 40% CPU consumption is embarrassing but it doesn't crash the system. Having an RCU offload K threads is kind of embarrassing, it's kind of annoying but it doesn't crash the system. So if you have bugs that are minor or very restricted in their nature you need to be more careful about validating the fixes. I mean if you've got a deterministic boot time panic there's not much you can do to make that worse. I mean sure you could put something in their bricks of system too just to be on the safe side but I mean no matter how bad it is things can always be worse. But still the odds of that are somewhat lower. If it's happening really bad a random change is likely to make it better is to make it worse. If things are slightly bad a random change is going to make it worse. And so what I should have done is be much more measured about my fixing the bugs. It was kind of I felt really bad. This stuff was out in the distros and I had this stuff screwed up so I could fix it right now. Sometimes you need to be a little more measured. And this is why we have lots of rules about what sort of stuff is accepted when in case those have been irritating you at various times. People probably trust me when they should but this experience may well have fixed that problem. The other thing is it's not enough to check your assumptions. You see if you have an assumption that you've held for a long time like if somebody is using this feature they built their own kernel. You build these towers of logic on top of it. And then you've probably adopted processes based on that logic and then you've probably formed habits based on those processes. And if you invalid the assumption you've got to work your way through back all of that stuff and of course the hardest thing is you've got to change your habits that are wrong because you had a bad assumption behind them. And this is not something that you can say oh that assumption doesn't matter anymore but there's a lot of work beyond that to actually avoid implicitly making that assumption even when you know what's wrong. As was the case with the build their own kernel thing. So anyway my if I had been going through that when I saw that it was going to roll 7 I would have panicked and started hitting validation and screaming at other people to validate and we probably would have at least found the problems earlier. Eventually we did find the problems and I think it wasn't that big of a deal. Not as bad as it could have been but that's not saying much but you do have to be careful. Next slide just some additional things going on with bare metal and some more information. If you're using it and want to do configuration cheat seats so these slides we posted you can look at them later and booting and so on and you know summary about helping to make Linux work better for more extreme computing. And of course the obligatory slide sponsored by IBM Legal. I don't know if we have time for a question or two. Okay if people have questions be happy to take them. Slightly double-barred one. What happened to the user land test? Like does that go into a test suite somewhere? So why didn't user land test suites catch this? Sorry the new one that you wrote what happened to it? It's sitting there on my laptop. It's not a regression test. You have to like no RCU and rip pieces of it out and kind of drop into this file and then run this thing and see what comes out. So no it's a design test or an implementation test rather than a regression test. If it kept causing problems over and over again I might script it put some comments in or something and script removing and putting in there and run it as part of my regression test suite but it'll have to break a few more times before I take that on. You're young. You need the exercise. You can handle it. Paul in slide 45 you say other linear scalability issues will strike first. Can you describe a few of them? This one? Yes. So the question was what kind of issues? Which are the scalability issues that you mentioned? Well I don't know for sure but I'm betting that some part of the kernel will. If you have 512 CPUs and you have a workload of context switching every 200 microseconds on each CPU I'm pretty sure that there's some parts of Linux will protest rather vigorously about that treatment. I don't know what they are yet but but you know I might be wrong. It could be that I get hammered nobody else does in which case well I'll fix the bug right. If there are no more questions thank you very much for your time and attention and hopefully this helps some people out and have a great rest of the conference.