 So, yes, I'm going to be talking about context switching in PowerPC Linux. Thanks very much for not going to Rusty's talk. So, what we're going to be talking about today, we're going to start off with a quick overview of what PowerPC is, and then a little bit of an overview of the architecture, mostly relating to context switching. Then talk about what we do with context switching currently in Linux on PowerPC, what we've done to change it, the new version, and then some of the results and some to-dos of what we could do in the future. So, PowerPC, what is it? So, these are the nice big fat machines that IBM make that I get to work on. In the middle there is the, as IBM affectionately calls it, the 795, which is a very large 1024 thread machine with up to eight terabytes of memory. Very nice machine to play on. You can partition this up to lots of little machines, or you can have one great big Linux instance running on 1,024 threads. We also go down to very small machines. You can have a little fork or blade if you want, and these are all 64-bit PowerPC machines. Another PowerPC machine is the BlueJean range of supercomputers. These guys have dominated the top 500 supercomputer list for the last few years. They're not on top at the moment, but they have done very well in the last few years. These are based on 32-bit PowerPC machines or PowerPC cores. PowerPC is also used in embedded. The TVO uses a PowerPC. And obviously we completely own the game's market. So, the PlayStation has a cell processor, which is a 64-bit PowerPC processor, with two threads with eight vector processing units attached to it, closely coupled to the CPU. The Xbox 360 is a 64-bit PowerPC 3-core, and the Wii is also a 32-bit PowerPC box. Of course, the big pink elephant in the room is these guys. The big pink elephant is hiding apparently. Apologize, mate. There we go. There's the big pink elephant finally, mate. So Apple was running PowerPC, but obviously they're not now. So PowerPC architecture, just going to give you a quick overview of what it means in relation to context switching. PowerPC is very simple risk architecture. We have 32-64-bit general-purpose registers. These GPRs are used on almost every instruction by every task in the system. And they're always context-switched in and out. We also have a bunch of extended state. So we have some floating-point registers. We have 32-64-bit registers. So that's about 256 bytes worth of data, a couple of cache lines in PowerPC. We also have some vector registers, ulti-vec registers. There's another 32, but these are 128-bits wide. These are similar to MMX or SSE on Intel processor. And on Power7 we added the vector scalar registers, the VSX registers, which is 64 128-bit registers. So we're talking here about another 1,024 bytes worth of extended state that would need to be context-switched for every task. It actually turns out the VSX overlaps with the floating-point and ulti-vec registers. So there's not additional state on top of FP and ulti-vec. It overlaps. We also have miscellaneous control and status registers to say if you divide two floating-point numbers, whether the result is a zero or not a number or infinity or whatever. And we also have, from an operating system's perspective, we can turn off these individual parts of the CPU. So you can say turn off floating-point. So if a user space program tries to run floating-point instructions, it will trap into the kernel, and the kernel can decide what it wants to do, whether it can restore, whether it can let that user space program run that or not. And we have that for FP, ulti-vec, and for VSX. And these are quite important, which I'll talk about soon. So, context-switching. What do we currently do? So just a quick little computer science operating system 101. What am I talking about? Slide. If you're running two tasks on a system, Linux will switch between them. If you've got task A and task B, we'll switch between them many times a second to make it look like they're both running it exactly at the same time. And that's context-switching. In reality, it looks a little bit more like this, where we have A running. It ends its time slice. Then the kernel preempts A at the end of its time slice and goes, I'll save away its state into memory. Then it calls into the scheduler. The scheduler says, I want to run task B. So it gets task B, state from memory, pulls that into the CPU, and then starts to take task B. Task B then runs for its time slice and kernel preempts it, and we wash, rinse, repeat over onto A and then B. So at the moment, we do lazy floating point restore. Now, I'm going to talk about floating point for the most of the rest of this talk. I will, at the end, talk about Altivec and VSX and how that differs, but for the majority of the talk, I'm going to be talking about floating point. So for floating point, we always save. At the end of a time slice, we always save. We always save the floating point state, but we restore on demand. So what does that mean? Well, the kernel decides it wants to run A, and A starts with the floating point unit off. And A, in this diagram, we have A as a floating point intensive task and B as a non-floating point intensive task. So A runs along with the floating point unit off. Then it wants to run a floating point instruction, but because the floating point unit is off, it traps into the kernel and the kernel goes, ah, okay, you want to run floating point. I'll restore your state from memory into the floating point unit and then turn the floating point unit on and continue on. And then A can continue on rerunning that instruction until it ends its time slice. Then at the end of the time slice, the kernel goes, I better save away that state. It knows that A used the floating point unit, so it saves away that state always. And then, again, calls into the scheduler, scheduler decides it wants to run B, and we restore B, but here we restore B with the floating point unit off. And this is great. This means that with this mechanism, if A and B are not floating point tasks, then we never have to actually save and restore the floating point state. And it's on a per-time slice basis. Just extending that a little bit longer, you can see that we have the original A and then B, and then A runs again. We have to then go and trap again. So every time the A starts its time slice, we have to trap into the kernel to restore our state. This just elaborates that a little bit more. We split this out into the CPU task at the top and the floating point unit task at the bottom. Sorry, it's dropped off the edge of the screen there. That's floating point unit task on the left-hand side there. It shows up OK here, so sorry about that. The CPU task at the top, floating point task at the bottom, you can see that the floating point task initially isn't really, the floating point isn't really associated with anything. And then when A runs that first floating point instruction, the floating point unit gets restored with A state. Then when B runs, before B runs, that floating point state gets flushed out to memory, so the floating point unit again has no state associated with it. You can see while B is running, there's no floating point state up until A traps again on a second time. So currently there's three possible places where our state could be. If our task is for a particular task, if a task is not running, we always save away its state at the end of its time slice, so it's always in memory. If it is running, the floating point state can be in the floating point unit if we have used floating point in this time slice where it could still be lazily out in memory. And I'm going to expand on this as we go to, as it gets more complicated, as it gets more complex. So what do we do this new? So we were restoring on demand and we've kept that, but what we're going to do is save on demand as well, rather than saving at the end of the time slice. And we're going to try and reuse the floating point state if possible. So what does this mean? So again we have A and B, A is floating point intensive and B is non-floating point intensive. You can see A is running at the start with the floating point unit off. The floating point unit owns the A task, or is associated with the A task. Then the kernel starts task B and it knows that, well, it's not going to use the floating point unit, so it comes with the floating point unit off, doesn't change the floating point unit task. You can see my FPU now, that's great. And it doesn't change the FPU task. So when A starts running again, it knows that the floating point state is already there and it can just start A with the floating point unit on and not have to bother restoring that state, restoring that 256 bytes. So this is great. If C comes along and C is a floating point intensive task, we still need to go and save and restore, obviously. So you can see here C comes in with the floating point unit off. The floating point unit is associated with task A, then when C traps into the kernel, the floating point unit changes over to C, as the kernel saves and restores A, restores C, and then C can run with the floating point unit off. So we still have to take this trap if we're running two floating point intensive tasks. Going back to the possible places where our state could be now, if a task is not running, we now have an extra state where the floating point state could possibly be still in a floating point unit somewhere. It may have been lazily saved away. It could also be in memory, and again, if it was running, it could be in the floating point unit or in memory, but that's the same as before. So this is all very simple. This is great. Hopefully we've saved some time. But we're missing a big piece here. The big piece is SMP, multiprocessor. What happens if we start migrating tasks from one CPU to another? What do we mean by this? So now we have CPU0 or PU0, apparently. PU0 at the top and CPU1 at the bottom, not floating point unit at the bottom. CPU0 starts off running task A and then switches to some other task. Doesn't matter what it is. Then CPU1 comes along running some another task, and it wants to switch over to A. But possibly we have our floating point state still on CPU0. So we've now put the floating point units back in again. We've got CPU0 and floating point unit zero at the top and CPU1 and floating point unit one at the bottom. So somehow CPU1 has to get its state from CPU1. Sorry, CPU1 has to get it from CPU0. Just going back to where our floating point state could be for a particular task. The top is the same if we're not running. It's either in an FPU or in memory. If we are running, we have an additional state. It could be on our current FPU if we've used floating point in this time slice. Or it could be on somebody else's floating point unit. It could be on a different CPU's floating point unit. So we've now got five possible places where our floating point state could be. So how do we do this CPU migration? So CPU1 at the bottom decides it wants to run A. It goes and IPIs over the CPU1 or inter-processor interrupt and says to CPU1, please go and dump that state that you have. So whatever task is running on CPU0 stops, gets preempted. And the kernel goes and takes the state that it has in the floating point unit and dumps it out to memory. It then communicates back to CPU1 to say, that's all done, you can start A up now. So that's great. The problem here is the kernel on CPU1 is just sitting around waiting for CPU0 to do this. And this could possibly take a long time and it's just kernel overhead. We really prefer not to do that. This is kind of a slow case CPU migration, but let's try and make it faster. So what we decided to do was start the IPI early, but then only wait for the results to come back when we need to. And essentially we're reusing our lazy restore here. So this is the same again. CPU1 goes and IPIs CPU0, but instead of waiting in the kernel, it now just starts task A straight away. But it starts it with the floating point unit off because it knows that it doesn't have the state there yet. CPU1 goes and dumps the state as we saw in the last or a couple of slides back, then returns back to CPU1 to tell it that the state is there. And fortunately in this case, it does it before A runs its first floating point task, a floating point instruction. Then A does run its first floating point instruction, it traps, the kernel goes, it's cool, my state's in memory ready to go, it's not on somebody else's CPU, I'll just restore A and away we go with the floating point unit on. So this is the fast case. This is good. But CPU0 could take a long time. So this is exactly the same diagram as the last one, except we've just changed the timing slightly. So here CPU1, IPI, CPU0, but CPU0 takes a long time for whatever reason. It could be running with interrupts off it doesn't receive the IPI for a while, it could have been sleeping, whatever. So it takes a long time to dump that state out. And task A runs a floating point instruction very early. So it actually runs it before the state is returned. So we now have to actually wait in the kernel. Now traditionally these floating point traps have been taken in a hand-coded assembler and they're taken with interrupts off. Now waiting, as we're waiting here with interrupts off is generally a very bad thing to do in the kernel. It's a nice recipe for deadlocks if you don't have any timeouts. So what do we do? We firstly check in the kernel to see if the state is there with interrupts off. If it's there, great, it came back in time, we can just restore A into the floating point unit. If it's not there, we have to turn interrupts on. So we turn interrupts on and we sit there and wait for the state to arrive. When it does arrive, when it eventually arrives, we turn interrupts back off again and we check to make sure that we didn't pick up somebody else's state. We may have imprimed it off to somewhere else and may have potentially picked up somebody else's floating point state. So we need to dump that before we restore A back. So how do we implement this? We've got two new pointers. The first pointer is a per CPU pointer. If you want to use the floating point unit associated with this CPU, you need to know, you need to be able to dump out the state in this current CPU. So this pointer will tell you whose state you currently own so that you can dump it. We've also created a per task pointer. If you want to context switch in a task, you need to know where to get it from, which CPU is it on. So this will tell you which CPU your state is on or if it's in memory. I promised I'd mention Ultebeck. Ultebeck is exactly the same. Wash, rinse, repeat. The control logic around it is identical. The actual save and restore instructions are different because they're different registers. But other than that, it's all pretty much the same. VSX, I said VSX actually overlaps with the floating point and Ultebeck state. In actual fact, there's no additional state that VSX provides over Ultebeck. We've got a question up here. Maybe you'll get to this, but can you actually prove that this is more efficient than just saving the state when you're doing SMP and extra things like the FPU and Ultebeck are in use? Well, I'm going to talk about some results later on, so maybe I'll answer that then. So we also have the additional case that if we know that FPU and Ultebeck can be reused, in that nice case where we have A is a VSX intensive task and B is not, we know that if FPU and Ultebeck can both be reused, then VSX can be reused as well, and we don't need to save and restore that. A few other issues. We have emerged 32 and 64 bit kernel. I mentioned that these floating point traps are all hand-coded assembler, so actually getting them to work with 32 and 64 bit with different MMUs and these run with the MMU on and off, making sure that all worked was somewhat not trivial, and thank you, Ben, for helping. Power save. There's a couple of issues here. One is if you're going to go into power save, you want to get rid of any possible state that you have so that you don't get working up later on to have to go and dump that. So that's one reason why. Before you go into power save, you want to flush your state that you have. The second reason is on bare metal systems, you may power down this package completely or power down the core, and the state may be corrupted, so you really need to dump it in that case. Hot plug CPU, very similar to power save. You need to make sure that if the CPU is going to leave the system that you've dumped the state already, otherwise it's not going to be accessible anymore. Different comping options. This was a fairly minor issue. There's about six or seven patches in this series, but making sure that they're all compiled with, especially in 32-bit LAN, where they actually care about having these different options of S&P, FPU, UltiBeck, VSX, making sure that the entire patch series was bisectable with all these different comping options, whereas again, non-trivial. Signals and P-trace. Because we've now got these additional places where our state can be, if someone's P-tracing you and wants to know your FPU state, you need to make sure that that's dumped out to memory. P-tracing signals will read it from memory. They won't directly go and read it from the floating point unit. So you need to make sure that they're dumped out before you actually go and read it. The same when you're generating a signal, a signal context for a process. You need to make sure it's dumped. So to hopefully answer your question a little bit, this is the patch series. This is the complete patches. So it's actually not that invasive. If you look at the, it's mostly in PowerPC, and you look at the bottom line here, 635 insertions and 315 dilation. So we're only adding about 320 lines of code. And I did a quick little poll here, or had a quick look at the patches the other day, and about 150 lines of that is documentation. I probably removed a little documentation too, but it's not a horribly invasive patch series. It does add some complexity, but we're pretty smart in PowerPC land. Well, at least we think we are. So testing, what did I do to test this? I actually coded up the UltiVec version first because UltiVec isn't used nearly as much as Floating Point, and this was easier to get a system booting and running so I could run tests. Once UltiVec was somewhat stable, I started looking at implementing the Floating Point version. Once I'd done that, just booting a distro kernel. Because Glib-C, especially in 64-bit, will use VSX in its mem copy, you'll notice pretty quickly if you start corrupting state. So just booting a distro up to user space with all its demons was a reasonable test at that point. Then I started looking at a custom Floating Point test. I don't care what the computation is at this point. I just care that I'm not corrupting the state in user space. It is, to kind of come back to what you were saying, it is reasonably complex, and if you screw up, the consequences are quite bad. You corrupt user space state, and that's a really, really bad thing to do. So with this test, I was trying to make sure that we're not corrupting state at all. So it was a very simple test here. The midi bit's in the middle. A equals B equals some constant value, and then we just check to make sure that it's always the same. We just sit there and spin on that, and if it ever pops out into that printf, then we know that we've corrupted our state. A couple of interesting things to note about this test. If you just have while A equals equals B without the B equals C and A and B are non-volatile, GCC just compiles that out to a branch to self. If you make A or B non-volatile, it won't compile it out, but when you do the printf, A and B will be reloaded from memory to values that was actually corrupted. So having the combination that we've got up there where C is a volatile and we reload, you still get to see the corrupted state, but you don't get GCC getting in your way. So some results. So I started off just writing a very simple benchmark. Again, I don't care about what the actual fp or floating point computation is. I just care that more time is allocated to user space than to the kernel. The kernel overhead is reduced. So all I did was write a simple test that started at zero and counted up to a maximum and see how long it takes. So if I do this, no performance improvement. But there was no regressions either, so that's good. But why is that? So we go back to our original diagram. It looked like this. And you see A, then a bit of kernel, then B, then a bit of kernel, then A. Unfortunately, it doesn't really look like this. It actually looks a lot more like this, which A, and then a tink-tink-tink-tink-tink-tink-tink little bit of kernel, and then B, and then a tink-tink-tink-tink-tink-tink-tink-tink bit of kernel. And actually, if you run just two processes on PowerPC without, we don't do any operating system calls, you end up only getting about 40 or 50 context switches per second. So the kernel really doesn't contribute much in this test. So I thought about how are we going to try and exaggerate this kernel, this kernel aspect. So I started talking to Anton. Anton Benchard has a little context-switching micro-benchmark. You can find there. It seems to be loosely based on BW pipe in LM bench. And this is the core of the test. Create two pipes. We fork. One side goes off to one while loop, which does a read-write. And the other side goes off to a write-read, which looks like that. Task A, Task B, A to the right, B reads it. Two closely coupled tasks. So hopefully, what was looking like that with these tiny little bits of kernel will look more like that. We have these gigantic bits of kernel. When I do run this test, oh, sorry, I forgot to mention that when we run this test, we bind it to the same CPU. So it's on the same CPU. So we do get they don't end up on different CPUs. If we do this, we get about 40,000 to 50,000 context switches per second. So a lot more kernel overhead. So if we do this, what happens? No forms of proofing. But no regressions again. But we've completely forgotten about floating point. So what happens if we stick a little bit of floating point inside A and B we leave without any floating point? So we get this nice, hopefully, get this nice state that we saw before where A just owns the floating point unit, and we never have to actually context switch those 256 bytes. If we do that, what happens? We get a 4% improvement in context switching rates with floating point. With Ulteavec, we can get another 4%. And with VSX, we can get 8%. And it doesn't seem like much, but user space will get this free. They don't have to do anything. And we're just reducing kernel overhead, which is always a good thing. Or generally a good thing, sorry. If we put floating point in both sides, sorry, did you want to? If we put floating point in both sides, what happens? So in this case, we're still going to take the trap, and we still have to save and restore. What happens? No performance improvements, but no regressions either. LM Bench, I decided I would actually run an industry standard benchmark. LM Bench does some operating system level, quite a few operating system level tests. No regressions with LM Bench. No significant performance improvements either, though. So I promised a bit of a to-do. Signals and Petrace, I'd like to do more testing in Signals and Petrace. I haven't done a lot of testing in that area yet. And it has generally been a very bug-rich environment in PowerPC. More benchmarking? I always do more benchmarking. Lazy allocation of space. This is actually some more coding we could do inside PowerPC. For VSX specifically, we have 1,024 bytes of data that's allocated for every task, even if it's not using VSX, it always gets statically allocated. So if you say have 1,000 tasks, 1,000 times 1,000 is 1 megabyte of kernel data or kernel memory. So that's somewhat significant. Unfortunately, because G-Lib C uses mem copy or uses VSX in mem copy, probably going to be allocated for almost every task anyway. So this may not actually help at all in reality. The other one that might be of more use, and I think this might be somewhat what you were alluding to, we could just auto restore. So if we think that this task is going to use floating point, instead of doing it lazily, we could just do it as we do the context switch. This would, we'd still have to actually do the copy of that 1,024 bytes or about four cache lines. But we would be able to avoid the trap. The trap shouldn't cost us too much on PowerPC, but it will cost us something. So this could potentially buy something, but it's on my to-do list. And that's it. So thanks very much. We got some time for questions. We do. Yeah, me again. So this kind of sounds like a solution looking for a problem. Like what was your use case that made you say, oh, we need this or I mean, and I work with embedded. So for me it would be like, well, we can get a smaller kernel footprint or something in memory, which probably isn't the case because I think you've added about 150 lines of complexity to your kernel port, which isn't always good. But what about power saving? So when you turn the FPU off or something, can you like power it down or something to save power? And in which case and maybe, you know, there is a case for just always saving the state out and always powering it down. And can you, another kind of question is, yeah, at what point does the use of FPU and extra state make it worthwhile interrupting the other processor to get it there when you've left it there in that case? I'd just curious if you're going to look at those things. I think we've got three questions. Yeah. My motivation was probably somewhat disappointing. I coded up the VSX side being part of the IBM team during power seven. I coded up the VSX side and started to get familiar with this code and thought this kind of seemed like a cool idea. So let's give it a go. Yeah, so Ben's just pointing out that we actually used, the thing I didn't mention here is the lazy save and restore, we used, we did actually do when we didn't have SMP enabled. And in actual fact, I think that made the code more complicated because there was a lot of if depths in there. This actually cleaned up a lot of that part because the embedded guys actually care about this. They actually care about compiling with NUP. So my main motivation at the start was this is kind of cool. One of the things I probably should have done is run that benchmark with the A and B, the closest couple, and see where it was in perf. And it turns out to be about the fifth or sixth item at about 4%. So probably should have done that first, but maybe I'm not as smart as us power PC guys think we are. Power saving, if we turn the FPU off. I guess there's potential there. I'd have to talk to some of our hardware designers whether that is in reality, if you're talking in a multi-threaded core, you've got other threads, like at least on power seven, we have four threads per core. You're just turning it off on one thread. So you'd have to somewhat coordinate with the other threads. Or you'd hope that the other threads would turn off their floating point units as well from an architectural perspective because they're behind the scenes. They're actually sharing the same FPU. So I guess there is a potential that they could do some power saving there, but. Go a little bit careful as well, is on those processors, the architectural state of the FPU is preserved when it's turned off. So we have something in the architecture for embedded. I don't know if we kept it or kicked it out. I don't remember, Paul might. We had some wording in there that said that some embedded processor were allowed to trash the FPU state when the FPU was turned off. I don't think that was ever implemented actually in power. But yes, if that was coming back, especially for the embedded field, we would have to make that some, an optional thing at runtime. But we could potentially turn off the ALUs inside the FPU unit, but keep the register files. Because that's what we really care about. And your last question has escaped me. Sorry. Who came to you and said, oh, we're noticing a lot of context kind of. Well, we have, Switch over here then. We have had some customers who've been trying to optimize these loops a bit. So it has had some attention in the past. But so it was partly that as well. The G-Lib C guys, I think Peter Bergner was talking a little bit, was talking about, he's one of the G-Lib C guys inside IBM that does PowerPC. He was also wanting to see some performance improvements. And we had a bit of a discussion on the list about, on our external list about it. So there was some push other than just my, this seems like a cool thing to code up. Concept, I think Ben wants to. Ben is the PowerPC maintainer by the way, so. Keep in mind that we thought that we actually have to save and restore on every context. Switch one K of state, which can be quite significant latency considering the speed of CPU versus the memory on the caches nowadays. So also by the time you can take switch, your cache for that state has probably been pushed out. And so you're going to take cache misses. With 1024, you've got four cache lines plus. You've got the FP at miscellaneous state plus the VSX. So you've got probably six cache lines. So that means that essentially on SMP, you will always have that cost. Even if your task is not migrating or is not being switching with somewhat another task. So it does, the ability to, the main point is to be able to maintain the state over multiple contexts, which are the same task if no other task goes in between. And the corollary to that is to be able to do that in SMP, you then need to bring the whole shebang with the interrupt and everything. But we don't necessarily care about the migration case, because it's a rare case. We just want to enable the non-migration case. But we had to solve the SMP problem to be able to do that. This, there was a talk here last year by some guys in IBM LTC in India who talked about optimizing the scheduler for trying to place tasks based on what resources were shared. And they were looking at caches. So you tried to get, try and split apart cache intensive or memory intensive tasks. And I think you could possibly do a similar thing with this where you could try and split apart FPU intensive tasks or keep them on different CPUs so that you could try to get more of this good case that we have. Did you try benchmarking the ideal case where you've got a single task FPU intensive on a single CPU with affinity set? Like you're ideal, high, I want to run a floating point monster, but it's my machine, that's all I'm doing. Because of context which save. Well, that was what the first benchmark that I ran. Didn't you have four processors but you still have the context switch with another processing? So the diagram and what I actually did was a little bit misaligned there. The test that I ran was actually a single processor but I did run it with one and with two. But again, there was no performance improvement. But that's just because you don't get many context switches with one process. So I'm curious, you said that G-Lib C in the use of the VSX stuff or everything, so that means you don't get them. Not everything, a user for mem copy. Okay, so. 64-bit will use mem copy. But that means that every context switch will have to save state on VSX. Would there be a win like if you don't need testing to stop G-Lib C doing that or does it get more of a win by using that? So this is just a per time slice issue. So if you're not calling mem copy, then the G-Lib C's mem copy in this time slice then you won't use VSX. So yeah, so it won't have to be swapped in. Yep, kernel threads as well. Because when the context switch is out we don't know what else is gonna come in. We had to save the state regardless even if what's coming behind is a kernel thread. And for example, with heavy network benchmark you can have quite a lot of software QD coming in and stuff like that. And those don't actually have any HP state. And so that limits the impact that they have on the underlying floating point task as well. Yeah, we've hardly used the floating point inside the kernel. I think David, I have a question. Did you gather metrics? Did I gather metrics on how often I wanted, how often this worked? In my ideal case, almost every time. It was, it actually did work. That case back where I talked about the slow IPI where we had to actually go and wait, we had to actually wait for the data to come back from CPU zero. That was actually quite rare that we had to actually wait in CPU one for that data to come back. Even on a heavily loaded system and bouncing it around. So I did actually figure to mention with that architected state test I ran a number of those in my testing and I bounced them around the machines to make sure that it actually was doing what I hoped it wasn't corrupting as it was bouncing around. It wasn't that often that we hit these really slow cases. I think we actually, in the kernel, we send the IPI very early and it takes us a while to actually get back to A even with the floating point unit off. So there's actually quite a bit of time there for that IPI to go away. I guess it would depend also on what drivers you're running and when that guy accused off and stuff like that. So it would depend a little bit on what workload is and if I maybe didn't hit that one then that would be a problem. But the lazy case worked. If I had A and B, this tightly coupled, it hit it almost every time. I could see in perf context switching and my little stats that had was pretty much one for one for the number of context switches, for the number of lazy context switches. So if your kernel was off, how to say why didn't give you a performance improvement? Why? Why? Because you're talking about the 4%. The whole thing? Because I don't think in reality it's that much overhead compared to the rest of the kernel. I think the scheduler, looking at the perf, looking at the perf outputs, there's a lot of other things. And I had to actually disable some things that the distros have like auditing and there was a couple of other things to even get that 4%. So that's, do we just keep that between you and me? So yeah, it's just not, it isn't actually a huge part. It's kind of cool. Yeah. Any questions? We do have time for another question or two. Could you all put your hands together for our speaker? Thank you.