 Hello, everyone, and good afternoon. My name is Oliver Upton. I'm a software engineer at Google. I've been working on KVM for the past few years in support of Google Cloud. And today, I'd like to share with you some of the work that my team has done for scaling KVM on ARM. So first, oops, just a second here. First, a quick introduction. So Google Cloud recently announced the T2A family of virtual machines, which is actually its first offering built on the ARM architecture. The product is currently in public preview, and it's built using the Ampere Ultra SOC. Similar to our x86-based offerings, we're using a KVM-based virtualization stack. But one decision that we made is using a new kernel, the so-called Icebreaker kernel, which was presented last year by some of my colleagues at Google. It is intended to be a close to upstream kernel for use in Google production workloads. And we chose to use that for T2A in order to stay as close as possible with upstream and focus the majority of our engineering efforts on upstream. So onto today's topic. In order to prepare for running virtual machines on ARM, we spent some time trying to scale the ARM port of KVM and at least test it out. And it turns out that it actually runs quite well in steady state. But we found a rather familiar problem, which was dirty tracking. We noticed that we had a lot of MMU lock contention when we enabled dirty logging for pre-copy. So we actually need to do write protection for dirty logging. There's no alternative such as Intel's PML on ARM. So actually, we were taking a very high frequency of stage to abort at the beginning of pre-copy. And the dirty tracking state is always measured at PTE granularity, which will be relevant here in a little bit. So in order to really characterize the worst case scenario we have here, we built a test workload. So we chose our largest VM, which is a 48 vCPU VM with 192 gigs of RAM. And we backed that with 2Meg, huge TLB. And we put a user space in it that effectively runs a simple thread on all the cores, which is trying to do as many writes as possible to memory striding per page size granularity. So we first allow it to pre-populate the memory. And then after some time when it reaches steady state, we enable dirty logging. And we measure how quickly the guests can still continue to dirty memory. So when we were running on the initial kernel, we saw something that wasn't quite so live for live migration. We saw a rather significant degradation in performance when we enabled dirty logging on the order of around 99%. This, of course, was rather unfortunate. And even worse yet, the blackout period, well, you couldn't call it a blackout, but really the brownout period was around 30 seconds. So effectively, we killed the VM. So of course, when we looked under the hood, we saw a familiar problem. Much like on x86, we saw something that looked like a lock contention on the MMU lock. At least initially on ARM, the MMU was protected with a spin lock. And then when dirty logging is enabled, we actually split the pages lazily. So rather than doing an eager page split where we shatter all of guest memory down to 4K at the beginning of dirty logging, we're doing it in the VCPU fault path. So naturally, we went about fixing a similar problem with a similar solution. So we took a stab at trying to protect the ARM MMU with a read write lock. So in 5.18, we actually made the switch over to using a read write lock for KVM on ARM. However, the only thing that we put on the read side was the write unprotect path. So that's when we take a write fault and we're preparing to mark it as writable and mark the page as dirty. But that actually didn't do anything for the page split path. And then something that's on the mailing list right now, we actually applied the read side faults for the rest of the stage two aborts. So this was actually handling mapping memory and remapping, so the case of a page split or a page collapse. So that actually, that series looks rather similar to the TDP MMU on x86. We protected the page tables with RCU and then only freed them in an RCU grace period. Unfortunately, because of the architecture of KVM on ARM, the page table codes actually used twice. It's used both in the kernel as well as in the NVHE low visor. So a lot of care had to go into actually isolating the RCU references to just the kernel side because RCU is not available in NVHE. So with that, we went ahead and re-ran our test workload and we actually saw some signs of life. So here you can see that when we originally had a 30% or excuse me, a 30 second degradation of desk performance, we're seeing just around three to six seconds. However, that massive cliff is still unacceptable. And just as a heads up with these graphs, the purple side is gonna be on the source machine and the green side is actually on the target machine. So we were still seeing a lot of degradation at the beginning of dirty logging and it wasn't entirely clear why. So after poking around a little bit in the kernel and looking at some traces, it looked like we were spending an inordinate amount of time doing TLB and validations. So we started picking away at this and trying to understand what was going on. It didn't appear that there was any software locking. So what the heck's going on? We actually have to call this in the middle of a page split because of the break before make requirements of the ARM architecture. And I'll go over that right now because it's actually a little bit different from x86. So break before make, the ARM architecture is extremely prescriptive about how software can manipulate the page tables. This is unlike x86 where it's a bit more free, what we can do. And effectively under certain conditions, software must first invalidate a PTE. So clearing it and making sure that's visible to hardware before then again installing a visible PTE. So in the case of a page split, we can't directly jump from a huge page down to a table. We actually have to go through an intermediate step of zeroing and then also invalidating the TLBs. Effectively, this has the result of preventing TLB conflicts because at no point in time are two different values for the PTE visible to hardware in a system. And as I said, it's actually required for huge page splitting. So what does that sequence look like? So let's assume that we took a fault on a 2-meg block or huge page here. And we're getting ready to split it down to 4K for dirty walking. So the first step that we actually have to do is we have to zero the PTE. And then we use a DSB or a serialization barrier to ensure that after that instruction retires, that the zero value of the PTE is visible within the inter-sharable domain of the system. And then after that, we go ahead and flush the stage two. So we can do an IPA-based TLB invalidation and we flush any of the stage two entries that might exist in a TLB. And we also use a DSB here to make sure that that has completed throughout the system before moving on to the next part, which is another annoying caveat, which is that we also have to explicitly flush the combined stage one and stage two TLB entries. So even before the stage two is invalidated, it's possible that those entries were used to then fill a combined TLB entry. So if we were to actually resume immediately after doing the first invalidation, we could actually see TLB conflicts. Then finally, we have to serialize the instruction pipeline and write the new value. So we write in the table, we install a page and wire up the guests and get it running again. So what are some of the side effects that we notice from break before make? So the TLB invalidations were broadcasting on the inter-sharable domain. This is something that we just do in Linux whenever manipulating the page tables. We don't focus on outer-sharable or system-wide TLB invalidations. And then the DSB instruction is actually where we feel a lot of the pain. So the DSB instruction is going to wait for all in-flight invalidations in a certain shareability domain. So in the case of inter-sharable, we're gonna be waiting on other VCPU threads perhaps that are doing TLB invalidations. So then the observation is that on a fully loaded system, so let's say a 48 VCPU VM where all 48 of those VCPUs are faulting doing TLB invalidations and finally resuming the guest, we see that that sequence took an upwards of several milliseconds for a single VCPU to complete. So the result is absolutely unacceptable VCPU latency. So then actually the next question comes up of, well, what happens if I just skip it? And actually it's implementation-specific what could happen and none of them are fun. The first case is that the hardware is polite and it informs you that it found a TLB conflict and it gives you a TLB conflict aboard. In the base architecture, it's not entirely clear whether or not that's going to go to EL1 or EL2, so EL1 being the guests or EL2 being the host. And in this case, there's not a whole lot we can do other than flush all the TLBs. But more terrifying is what happens if the TLB returns one of the two valid mappings or possibly an amalgamation of the two TLB entries. So if that's not enough to scare you, the arm arm actually goes a bit further and informs you that it's open season for all kinds of fun things to happen, such as breaking the architectural guarantees of coherency, single copy atomicity and ordering of the memory model. So then if we can't get rid of them, we have to find a way to minimize the pain on the guest. So first one's pretty easy. We had to look for ways that we could get rid of unnecessary broadcasting of TLB invalidations. And one of the areas where KVM was originally doing TLB invalidations in a broadcasted manner was for right unprotected or relaxing the permissions on a PTE. This is actually not necessary. We don't have to broadcast in this case. We can just do a local TLB flush. So we actually applied a patch for this and we did see a pretty decent uplift from that. And then beyond that, we had to find some way to get around the unavoidable TLB invalidation. So for every huge page, we will have to pay the cost of a break before make. We just would like to do it somewhere other than in the VCPU fault path. So the first implementation might look something like eager page flipping like what we do on x86, where as soon as dirty logging is enabled on a mem slot, we shatter all of the huge pages down to 4K granularity. However, this actually has a similar problem because we then again, will saturate with a bunch of TLB invalidations and effectively stall the guest until we're done. So we had to look for a way to spread it out. And actually what we did is we took advantage of the manual dirty log protection. So we actually use the KVM clear dirty log iOctl to allow user space to rate limit how we do page splits. So when user space issues the clear dirty log iOctl, we only split that range of pages and then it's up to user space to decide how much time to wait in between. So with that, we actually experimented quite a bit and we come up with something that looked a lot more like this. So with the user space controls to throttle dirty logging, we were able to minimize the break before make overhead. So we actually spread out the TLB invalidations over a period of time. And we see a much more gradual and steady degradation of guest performance, no huge cliffs and certainly nowhere near as significant of a drop. So then what's next? There's a lot of things that are coming in the ARM architecture that could be promising. So really we would like to have hardware fix this. This is something that we don't want to be dealing with in software. So as far as in the integration of a system, it could be that an interconnect could implement TLB snoop filters to help complete TLB invalidations in a quicker amount of time. And then on top of that, there's some cool extensions coming to the core architecture as well. So TLBI range instructions are a feature of the later ARM architecture, which allows software to invalidate a range of memory instead of a specific page or granular memory. So we can actually batch up TLB invalidations without doing a global flush. And then on top of that, we have what's called level two break before make support. So later revisions of the architecture relax some of those break before make requirements and allow software to do more things without doing the invalidations in between. So most importantly to us in KVM, level two break before make means that we don't actually have to do an invalidation in the middle of a huge page split or a table collapse. So in this case, there is one bit of a snag, which is that software actually has to deal with TLB conflict aborts. This is something that we don't do currently in KVM. And when that happens, our only option is really to just do a global flush anyhow. So there will need to be some care in how exactly we apply that and when we might need to do TLB invalidations in between to avoid the likelihood of a TLB conflict abort. On top of that, I would like to give some acknowledgements to some of the folks on my team. As I mentioned, this was a team effort. And first and foremost, I would like to thank Riccardo Kohler for implementing the eager page split mechanism. So he did the clear Iocl implementation. Thank you, David Matlack, who we'll be hearing from in the next talk. He actually gave the suggestion for using manual dirty log protection to do the page split re-limiting. Thank you, Marc Zanger, for the patch that we stole from you for non-shareable TLB invalidations. Turns out it works, thank you. And then also thank you, Jing Zhang, from Google as well, for upstreaming the parallel write-on-protection path. And then with that, are there any questions? Marc, I think I'm gonna go ahead and repeat the questions. Okay, if you can go back a couple of slides. Yeah, one more? We'll go. Back more? No, forward. Forward, okay, yeah. What TLB range? Although that would solve something at stage two, but you still need a full validation for stage one. Right. So, do you actually gain something there? So you gain the fact that you can... Was that? Oh yeah, I'm sorry, I'm sorry, yes. So Marc asked whether or not the TLBI range instructions actually give us anything for improving the pain that we see from TLB invalidations. And I think so, that the answer is yes, but it has to come with both break before make and the TLBI range. So we can actually do a set of operations and then in between we can do TLBI range so we can flush that batch size that we did. Yes, we still have to flush the combined mappings, but there's a high likelihood that we left behind some stage two entries in the TLB as well. I have another question, which is, not down the table is to be implemented and also to arm so that there is some quality requirements to do the off-take. It looks to me like, not to say that the off-take is doing the wrong thing, but implementation actually not staying up. So Marc's second question was whether or not there's any feedback that's been going to arm or to implementers of arm with regards to this problem and if there's any considerations for fixing it. And as far as I know, I would hope, but I can't really speak definitively for what folks are saying, but this is something that we really should be noisy about in the hopes of getting some answers. Yes. So in that case, we wouldn't actually, yes, so the question was, if we have something like Intel's PML, is that going to improve the situation? And again, I think so. So the reason that we have to do these TLB invalidations is at least for right unprotection, we would prefer to not take the fault and there is still some cost associated with a non-broadcasted TLB invalidation. It doesn't solve the problem of break before make. That in particular has to come from better implementations, but yeah, we'll see. Because again, the problem comes down to integration and hopefully with all three of these, you see something that works, but who knows. Yes. Right, so Will is asking if we considered using VMIDs to have a separate VMID for the dirty logging phase where we map at 4K and another one for the highest granularity. So we hadn't considered that yet, but that certainly sounds like a good option to maybe at least research. Any other questions? So Dave is asking whether or not the architecture clarifies who gets the TLB conflict abort with level two break before make and it is EL2. So KVM will get to handle the TLB conflict. Anything else? Awesome, thank you.