 I'm Ankur and today I'll be talking about a dynamic parword lock ops. So I start off talking about the motivation for this feature, then I'll talk about the state machine for Qt spin locks and then I'll talk about, you know, what does switching PV lock ops, what does that involve, mostly the requirements and what do we need for safety and then I'll talk about the patching mechanism, which is breakpoint based and the design and implementation of V1 and then the design of the V2 for this feature. So this whole thing started wanting guests to be more dynamic and dynamic specifically in the sense that KVM advertises to guests, a KVM host advertises to guests, you know, this feature, whether the the VCPUs for the for the guests are oversubscribed or not. And if they're not, then in quite a lot of respects, it's fairly similar to bare metal where, you know, at least you know that a different process will not schedule out your VCPU. And so if the center is true, a KVM guest would basically use native PV locks. If it isn't, meaning it is possible that that your threats, VCPU threats can get scheduled out, you use parword PV lock ops. And that's, that's good, except that if this, this hint becomes untrue at some point in the lifetime of the guest, you know, the host could, could move to an over subscribed mode for, for whatever reason, and the guest has no control over that. Then typically, you know, the guest would see soft lock ops. And, and, you know, the, the recommended fix for this is that a host should only advertise this particular feature, or this particular hint, if it can guarantee the, the hint for the lifetime of the guest. Now that seems kind of reasonable, but it does seem unsatisfying given that most of this presentation is about switching parword locks, which are based on, on, on cute spin locks. Let's, you know, go over some of the interfaces involved. So there are five interfaces for this, and, you know, you can see them, all of them outlined in blue. Interestingly, the, the one interface, which is, which is not a PV lock up is cute spinner spin lock. And, and that's pretty interesting. And, and, you know, that's the fact that, that, that causes a fair amount of trouble, you know, when you're definitely switching these, because now you have no way of getting entry into into the state machine, essentially, right, you can get the PV lock ops, but you cannot get cute spin lock. For one thing, you don't even know, you know, all the call sites that cute spin lock is caught from. So the other notable thing in this state machine is, is, you know, the skidly lines. So all of the interfaces on the other side of those, that's basically bait kick. VCPU is created a somewhat special, I'll come to it, but bait and kick are basically no ops for, for the native case. Did, which, of course, I mean, make sense for the, in KVM, weight is KVM weight. It basically goes and does a halt in, in the host and, and kick, essentially, it does a KVM hyper call and, you know, it does what you would really expect. So now, having seen the state machine, let's go a little bit into why don't we just use paroid locks all the time. So you can see that, you know, the unlock fast path is different for, for both of them. And this slide doesn't say which is which, but I'm sure it's easy to guess that, you know, the native case, it's just literally a move. The paroid case is a lot compare exchange. And, you know, if, if that comparison fails, you take the slow path, find the next node and kick it. Um, and I mean, clearly they are optimized for different use cases, but you would pay the extra cost of compare exchange, even when you don't, don't need it. Um, okay. Another example is that Q would spin lock slow path. It's pessimistic by default for, for paroid. Uh, so in the native case, you know, Q would spin lock slow path would, um, you know, spin for a little bit, uh, and then queue up, uh, if it still hasn't gotten a lock paroid case, the first thing it does is it gets the MCS node, you know, in preparation to, to kind of queue up. All right. Now that I've hopefully convinced you, uh, on why we need this feature, what does it, uh, involve? So fundamentally, what you need to do is you have five interfaces, um, these are spread all over, you know, the kernel modules and so on. And you need to switch all of them atomically. What does actually, uh, switching them involved? So, you know, this is the example of spin unlock. So you have to transform between the, the op codes between, you know, one of, uh, from one to the other, uh, sequence and of course you might be doing it multiple times. Um, so the native queue spin unlock, as I think I showed in a previous slide, it's, it's basically a move three bytes and, uh, a four byte no op, the PV queue spin unlock is a, is a call to the actual function that she does the queueing and it's an exchange, uh, which is basically a two byte no op. The, the, the call is interesting because no, it's a call. Now you imagine a different CPU is doing the patching, um, while, you know, one CPU is in the PV queue spin unlock call and it's called, you know, whatever the sadness might be then it, when the scatter turns, uh, the queue spin unlock, when it returns from the call, it returns to address five and, you know, it expects to execute 6690. However, if you have managed to finish the, the switching by that point, um, the contents of address, uh, five here are four, zero, zero, zero and as you can imagine, that probably will not go very well and, you know, just to be able to point a little bit more, uh, you might, you might have other ops, which, uh, similarly go from a call plus no op to, to just a no op seven or back. Uh, now one nice thing is that spin locks cannot sleep. They could be present for, for active threats, but they will not be present for sleeping threats and, and that's great. All right. So what are the possible active users, uh, of spin locks while we are doing this pattern? So it's, it's really the standard suspects. You have tasks, you have software queues, um, you have interrupt handlers, you have NMI handlers, the only thing which might not be taking spin lock is if it's in a user threat context. Interrupt handlers especially are, are interesting because the mechanism, which I'll describe a little more, uh, in a later slide, is textbook BP, which uses breakpoints and it's essentially a three-phased, um, um, patching process for each call site and, and so you would, you know, do a, write, write an op code, um, after you've finished writing the op code, you need to synchronize, you need to ensure that all CPU pipelines, they're synchronized. Um, so the caches are synchronized by, by default. What you have to worry about is the pipeline synchronization, um, because it's possible that, you know, you have the pipeline cache, which has, uh, which has prefetched, prefetched, you know, some of these instructions, uh, maybe they've been decoded into micro ops and so on. Um, so, so the IPI is sent out to, to essentially synchronize the pipeline. And essentially what happens in the IPI is that, you know, when the remote CPU receives an IPI, it executes, uh, or it'll eventually execute and add it to make sure that it synchronizes the pipeline, such that it fetches the, the recent, the latest version of, uh, the, the latest version of the pipeline. The, the latest version of instructions written. So IPI is also takes spin locks. So, so yeah, we need to be kind of prepared for it or handle it in some way. All right. Now let's, uh, talk a little bit about the mechanism that we'll use for this. So the mechanism is a fairly standard, uh, the next mechanism, um, which, which we use for modifying cross-modifying code, uh, there's a, there's a typo in that, in the second line. Uh, the problem it solves is of patching while potentially executing code that is being patched. And the, the great thing about this is that it's a single byte instruction. So, so you, so you know that you can always atomically both write it and execute it. Um, and the way you use it is by writing the replacing the first byte of the sequence that you are, that you're modifying by, you know, the soft code. It serves as a barrier to, to, to entry. Uh, the, the, the one thing that you do need for that is that this instruction sequence that, that you are in the process of writing should have just one entry, uh, from, you know, it should only be entered from the first byte, from the, from the first byte that you just replaced with 0xcc. And, you know, if, if this barrier gets hit, uh, the control flow shifts in three handler and in the, in three handler based on the, on the address of where this was hit, you know, um, you, you know what, what is the sequence that was supposed to be there? Or you know that, you know, that was a pvlock op or, or what have you. So, so you know what you should emulate now. And, and pvlock ops, uh, they, they start their life, uh, when the kernel boots as, as indirect calls. So fundamentally you are really just, just executing those indirect calls, which are functionally equivalent to the, to the opcode step was stored here. All right. Uh, now let's talk about what we actually do in V1. So one way of sidestepping, you know, a lot of the difficulties I outlined earlier is basically to use stop machine. Um, so how we use it is basically there's a patching CPU, the CPU patcher. And a bunch of, uh, you know, secondary CPUs, all of them essentially work in a lock step state machine, interrupts are disabled, you don't need IPIs, uh, for a sync core. Um, and you also know that, given that you are essentially hogging all the vcps, uh, all the CPUs, you know that no pvlock ops are on the stack because, you know, all other threads are scheduled out and you're not executing any pvlock ops. So it's a cowardly way of, uh, getting rid of a lot of the difficulties. And, uh, the, the only remaining risk is, is NMI's and because, you know, NMI's can come, they can, uh, so on, on the primary or the secondary or both. And, you know, the NMI handler can then, you know, execute a spin lock. Um, this, so that spin lock, if you're modifying that particular site right then, would end up going to the intree handler and, you know, you could have a deadlock there. So, so to avoid that, the intree handler also needs to implement a subset of the state machine. It essentially, uh, because all CPUs are participating in the state machine, you have to kind of ensure that the state machine keeps moving and, uh, and, and, you know, whichever context you are in, whether you're in thread context or in intree context. And with that, you can make forward progress. Um, if you have multiple NMI's, that complicates matters somewhat, but I won't go into that right now. All right. So, so this is going to an example of, um, what the state machine really looks like. Normally you would, you would use, uh, IPIs to, to do some of this here, as you can see, you know, the CPU X essentially, uh, does, uh, has, does SMB con load with the choir semantics to, you know, progress to the next stage and what is each stage. So, uh, the, the commented out section actually goes through, you know, how the, the op codes are really arranged. So the first step is you, you just patch the, the intree. And that's, that's what the patcher does. It just replaces the first byte within three. It does a local sync and it's, you know, changes some state with release semantics. And the, all the secondary CPUs, they basically do a choir for that state, you know, so the state will be say intree written. Uh, once they see that, you know, essentially intree has been written, they do the sync. Uh, then, you know, you can go and safely, uh, the patcher CPU can go and safely write the rest of the state, a rest of the op code bits. And, you know, anybody trying to execute this would at this point in time end up in the, in the, in 300, right? Uh, yeah. So, so you write the rest of the state. Um, you make sure that, uh, that the patcher has written it, um, on the secondaries. If it has, you, you do your sync. And then all that remains is for the patching CPU to write the first byte that you does and the iteration is, is complete. It works. The only problem is this is stop machine, which kind of sucks. And, um, you know, when I send this to upstream, there was a review comment, which I think in, in hindsight was pretty understated, which called it bong hits crazy code. So, so now V2. So you want to patch multiple sites atomically, right? Uh, your other CPUs could be executing arbitrary code, including spin log code. Um, and in any case, you know, patching even a single site is not atomic. There are multiple steps and each step, um, even in itself can get interrupted by, by an NMI. All right. So, so the first step is you first introduce a site local barrier everywhere. And the, what this allows you to do is it allows you to control what executes. So until the site local barrier is everywhere, um, you know, you just emulate the old code. So one step one is done. Um, you need to introduce a global barrier. Um, and the idea behind the global barrier is that, uh, before this, this barrier, you are only going to be executing old PD lock ops. After this barrier, you execute only new new ops. The, the important condition for this barrier is that there should be no spin locks. Uh, and, and thus, you know, no executing PV lock ops in the system. And, and once you have transitioned, you know, to the new PV lock ops, you kind of done all that you need to do is you need to stop emulating. So you go back and, and replace the in three that you, that you prefixed the, the real old code. Okay. So, so, you know, most of the, the work is really in the, in the global barrier. How do we actually get a global barrier when you have multiple VCPUs? Fundamentally, what you need to do is you need to count all spin locks under execution. And, uh, the counting needs to happen in 300, which is good. You know, that's a single point where everything kind of converges. Uh, of course, there is no real way of counting spin locks. Good spin lock is, is not a PV lock off. So, so, so you don't even know where, where all of them are in the, in the kernel or in the modules. All that you have is what you can get via the in three handler, which is the rest of the five op codes, not a huge spin lock. What's the property that this global barrier holds essentially? So, so the barrier itself is, is, you know, pretty simple. You either via RCU or a work queue or something that you essentially execute this patch barrier on all CPUs, except for the worker. Let's say this is in work queue context. So at this point, you know, when you're executing this barrier, you know that you're not, you don't actually have any locks on that CPU. There are no spin locks in that context on that CPU. So at that point in time, you can switch that CPU to a state where it says, you know, barrier executed. And you can essentially act that the CPU has executed the barrier. And that from this point on, the CPU needs to count active lock ops. Active lock ops, then it falls to zero means that there are no ongoing spin locks, let's say, or PV lock ops really in the system. And at that point, it's safe to switch. So that's the property in the next line. All CPUs have, have executed this barrier and no active lock ops in the system. So if the first line is the first clause is untrue, the barrier CPUs is less than no online CPUs, then you could have active lock ops on, you know, on some on a bunch of CPUs. But some of them are actually counting these active lock ops. Some are not. But once this condition is true, you know that all active lock ops in the system are actually getting counted. And then, you know, there are issues here. Spin lock is a really hot path. And so you might have to wait for a long time for this condition to be true. And until then, you know, you would kind of slow down the system. But if the load is too high, you can just abort and you can do things like that. But, but, but this condition is, is sufficient to transition to this new stage a little more on what we are counting and how we're counting. So first of all, notice what we cannot count. We cannot count the fast path. The, we have no control over cute spin lock. That does not go through the big point handler. So we cannot count invocations of the fast path. So they can be spin locks executing in the system. You know, you would only see them when they call cute spin unlock. What you can count is the slow path. And the great thing is that the slow path also protects the data structure because the data structure only gets accessed in the slow path. It might get accessed in the unlock if a different CPU has gone through the slow path for the same spin lock. You only really need to be able to tell the difference between the two cute spin unlocks because you know, one got what one gets executed in the fast path. So, you know, you should not be dropping a reference there. But you take a reference in cute spin lock slow path, which you should be dropping in the corresponding cute spin unlock. The one part of data structure that gets accessed is the bid representation, which marks whether this lock is taken or not. And that is constrained to be compatible for both of the spin lock types because cute spin lock is the same for, you know, both of the spin lock types. So cute spin unlock has no choice, but to use the same bid representation. And so the only remaining problem is being able to tell the cute spin unlocks apart in both of these cases, and that you do by just keeping some sort of a per-CPUBit map. And that's all there should be to V2. You can find the code here on GitHub. V2 is mostly design documents. V1, you can get the code on GitHub and the patches on LKML. And thanks for attending the talk. If you have any questions, I'll take them. If not, I'm happy to receive them on email or if you want to collaborate, just drop me an email. All right. Thank you very much.