 My name is David Matlack. I'm from Google, and I'm going to be talking about exploring an architecture neutral MMU for KVM. So first, what do I mean by memory management in KVM? Stepping back to about 10,000 feet. We've got user space calling into KVM, configuring the guest address layout, and how it's mapped to virtual addresses. That's basically mem slots. And then that informs how KVM sets up page tables to actually map host memory into the guest. Meanwhile, you've got VCPUs running concurrently, accessing memory, faulting on memory, which has to be resolved in the page tables. And then MMU notifiers coming from the host to reflect changes in the host page tables into KVM, such as if a page gets swapped out, unmapping it from KVM's page tables. So the thing to look at here is, core to the managing of memory in KVM is the page table management, which is actually how the memory gets mapped into a running virtual machine. And that there's a lot of inputs to this, and a lot of concurrent inputs to this. So the KVM MMU in cloud is a very critical piece of KVM. And the main important thing is the scalability of the KVM MMU. In cloud, we have large VMs with hundreds of VCPUs and terabytes of RAM. So within Google Cloud specifically, we've got up to 400 VCPU VMs and 12 terabytes of RAM. Broad range of customers that have a range of many different workloads and performance sensitivities. And live migration is a critical part of our host maintenance strategy. So VMs in cloud primarily use, at least in Google Cloud, these direct two-dimensional paging. And what I mean by that is a second stage of paging that translates guest physical addresses to host physical addresses. So within KVM x86, we call that TDP. KVM x86, we can do shadow paging, but it's only really required on ancient CPUs. If you look at ARM, it's always required TDP support, which is called stage two in ARM. The one exception to this, which I'll talk a bit more about later, is nested virtualization, which does use some amount of shadow paging. So a lot of development has gone into making the TDP in KVM x86 very scalable and performant for the cloud use cases I described earlier. I've got some years here about recent developments, but a lot of these features date back to 2015, some date back to 2012, in terms of how long we've been using it within Google Cloud. But in 2020, we upstreamed an entirely new MMU in KVM x86 called the TDP MMU that's focused on just the TDP paging use case, so it doesn't do any shadow paging. And within this, we added support for power level fault handling, so VCPUs could fault and populate non-present entries in the page tables in parallel, handling of write protection faults or dirty logging without taking the MMU lock. This dates back to almost 2012 in the x86 shadow MMU. And then support for eager page splitting, which is when at the beginning of dirty logging, the page tables are split down to 4K entries eagerly in the background rather than lazily at fault time. And this kind of isn't the end of the road, so more development is underway. We're working on new MoWare page table allocation doing d-bit based dirty logging, so not requiring any faults at all and just letting dirty bits be populated by hardware and support for the multi-generational LRU, which is a new way of doing access tracking in the core MM that requires some integration in KVM. So as Oliver mentioned in the last talk, Google Cloud recently announced our first ARM-based VM offering T2A VMs, support up to 48 BCPUs, 48, four gigs of RAM per BCPU based on the Emperor Ultra ARM-based processors. So we faced a lot of challenges scaling the KVM ARM and the MMU for these VMs. Oliver talked about these in the last talk, but if you weren't here, there's interconnect scalability, so in the actual hardware to handle broadcasted tail BIs and cache maintenance. The ARM architecture requires certain PT changes used break before make, which makes them very expensive and a few other issues as well. And many of the improvements to address these were ARM-specific, so for example, one of them was using a local TLB flush instead of a broadcast TLB flush after resolving write protection faults. But at the same time, we're still adopting many of the same techniques that we learned on the x86 side in the TDPM-U. So for example, Oliver upped in from Google recently, upstreamed or posted to the mailing list parallel fault handling that uses the same sort of reader-writer lock and RCU techniques that the TDPM uses to parallelize page table management and then doing eager page splitting to avoid page splitting faults. So if you look at the features we've got on x86 and what's upcoming, there's a lot of overlap between what we have on x86 and what we wanna do on ARM. So parallel fault handling and eager page splitting already mentioned, lockless write protection fault handling we have on x86 that's probably going to be an improvement on ARM that will be useful next until ARM has something like PML. Same thing with DBit based dirty logging and multi-gen LRU support. And if other architectures get traction in cloud and needs to support live migration, for example, risk five is a new architecture that's been getting a lot of attention. We expect to see a lot of the same bottlenecks that require very similar software solutions. So we may end up having to main N copies of these features. So is there ways we can share code instead? So the common theme among all these features is it's all about page table management, specific modifications to the page tables, plus ways to synchronize those changes. So parallel fault handling is all about how do we map guest physical addresses to a host physical address at some level in the page table hierarchy with certain permissions. Your page splitting is taking a range of guest memory and splitting all the huge pages down to a lower level, et cetera. Okay, so to start we can look at, well, what if we could make the TDPMMU in x86 architecture neutral and move the code from the x86 directory into an architecture neutral directory. And when we do this really the only thing we need to delegate to the architecture specific code is the low level architecture specific details like the layout of the page table entries and how to implement an actual TLB flush. And then once we've done this we can expose an API for common operations like for fault handling mapping a guest address to a physical address at a given level, relaxing the right permissions for a given guest address, et cetera. And then we can expose the TDPMU iterator which is the sort of basic data structure that all the TDPMU page table operations are implemented with. We can make it available for architecture specific page table operations. So I've worked on this internally. The RFC patches will probably be posted in a couple weeks but it is possible. And we go from having about 2.2300 lines of code in the x86 directory to about 91% of that being able to move to the architecture neutral and birth slash KBM. Okay, so we can make the x86 TDPMU architecture neutral but can it can actually support ARM or is it just a bunch of architecture neutral code that only gets used on x86? So x86 and ARM both use TDP. They both use a second stage of translation for VM memory and that second stage of translation and both uses a page table data structure. The page tables are page size, page table entries are 64 bits and each page table entry can point to a lower level page table entry or a huge page or page or nothing. So it's a pretty standard page table data structure. And of course page table entries can control read, write, execute and other attributes. But there are some major differences between x86 and ARM so I'll go through those that are relevant for doing sort of TDP paging now and sort of talk about how solvable they are and what kind of changes we would need to support it. So the first step is the memory model. So x86 uses a total store ordering memory model whereas ARM is weekly ordered. This is very solvable, it is just a software problem. So for one, PT writes in the TDPM mu would need to use SMB store release instead of just write once and there's probably some other minor changes to the sort of low level concurrency code in the TDPM mu that would need to use the appropriate barriers for ARM. Next up on x86 pages are always four kilobytes but on ARM pages can be four kilobytes, 16 kilobytes or 64 kilobytes. So is this solvable, yes. So KVM ARM stage two page sizes always follows the host page underscore size. So the TDPM mu instead of assuming page tables are four kilobytes and the guest pages are four kilobytes would need to use, you know, key everything off of the page size. For example, when calculating how many PTEs there are per table and when we're dealing with the different levels in the hierarchy, the TDPM mu today uses very 4K focused page level names like the 4K level, the two meg level and the one gig level. So we need to abstract that out to more architecture neutral names. Okay, next is the root page tables can be slightly different on ARM. So x86, the root page table is always just one page but on ARM you can actually concatenate pages together to avoid one level of lookup. So this is useful for supporting larger physical addresses instead of needing just one, like fifth level of page tables that just uses the first, you know, 16 entries. You could just have 16 concatenated level four page tables and the page table walker can avoid a level of lookup. So this would require some changes to the TDPM mu to support. So the root page table allocator and the TDPM mu would have to know how to allocate contiguous page tables depending on what the VM needs. And then the iterator, the TDPM mu iterator which is responsible for actually walking through the page tables would have to know about that roots could be contiguous and how to walk through the contiguous roots. I'll know that this would be required for performance parity with KVM ARM but isn't required for correctness. So. It is required for correctness. Okay, so that. That really doesn't work if you don't support contiguous page tables. Okay, so the comment from Mark was that this is required for correctness. Okay, so next up is huge page splitting which Oliver talked a bit about in the last talk. So on x86 huge pages can be split in place and what I mean by that is the huge page PTE can be directly replaced with the PTE that points to a lower level page table that's populated with the identical mapping. But on ARM, a break before make is required to split a huge page which means that the PTE must be marked invalid first and then flushed from all CPUs that are using that PTE and then it can be replaced with a PTE pointing to a lower level page table. The caveat here is newer versions of the ARM architecture support level two break before make which relaxes this requirement. So in terms of supporting this on the TTPM and you we could just add break before make support to the eager page splitting path that's keyed behind a static key check if the break is actually required or we could just require, you know only use the TTPM on CPUs with break before make level two. One thing I wanna note about the break before make level two support is that it's not it's not a free lunch so it can result in TLB conflict aborts which do require additional software to handle and maybe expensive to handle so this needs to be kind of explored further. So the level two break before make I don't think is even supported on the KBM ARM side yet. All right some other notable differences ARM requires break before make for other PTE changes as well but I went through the TTPM and you and I don't think any are relevant to the types of PTE changes that the TTPM and you makes most of the time we're modifying PTEs we just unmapped the range and let VCPs fault it back in to pick up any changes that it needs to make. The error page splitting is the only case where we're actually changing PTEs in place that require break before make in ARM architecture. ARM also requires cache maintenance operations after certain PTE changes so we would have to audit the different PTE changes we're making in the TTPM and you and make sure there's the appropriate hooks for ARM to actually do those cache maintenance operations this needs to be explored further. ARM does not guarantee permission faults evict or avoid creating TLB entries so on x86 side anytime you get you know so if we're on Intel for example if you get an EPT violation you're guaranteed that no entry if an entry in the TLB caused that fault it's been evicted so you don't actually have to do a manual flush you can just repair the PTE and resume the VM but on the ARM side there's some cases like so for example taking a right protection fault during dirty logging you actually have to do a TLB flush so we'd have to make sure we do that for the TTPM and you and the last is contiguous PTEs so ARM allows creating essentially intermediate huge page sizes where like up to 16 contiguous PTEs can combine be combined together to create one huge page kind of an intermediate huge page so Linux supports contiguous PTEs in the core MN but KVM ARM doesn't use contiguous PTEs for stage two today so for parity this isn't any no contiguous PTE supports needed to support KVM ARM in the TTPM and Mew but this might need to be revisited in the future so if 16 kilobyte granules or 64 kilobyte granules gain traction for cloud and virtualization then contiguous PTEs become very useful because they provide much more useful huge page sizes like two mag and one gate huge pages so that might be a point where we have to actually support contiguous PTEs okay the next thing I wanna bring up is PKVM so this isn't an architectural difference now but just in terms of software integration if we actually had this architecture neutral MN view so the TTPM Mew as I've described it is not compatible with PKVM today the reason being that the TTPM Mew calls out into Linux to use RCU do rescheduling, do locking, allocate memory but PKVM does the stage two page table management outside of KVM in this thing called the height which runs in a separate exception level and doesn't have access to call into Linux routines so we wouldn't be able to just sort of build the TTPM Mew into PKVM into the height and use it there but that's not to say that the TTPM Mew couldn't evolve to support PKVM we could do the same sort of refactor that was done on the KVM arm side to split out the pure page table manipulation code from the higher level operations like the locking and allocation RCU is a little bit trickier to support so RCU is used in the TTPM Mew for page table freeing but it's possible to come up with a solution that doesn't actually depend on RCU to make it work and this could be an opportunity to deploy PKVM a PKVM like solution to other architectures in a common way because the same sort of idea of pulling the TTP or the stage two page table management out of KVM and making guest memory inaccessible to KVM and Linux could be done on x86 as well. Alternatively, PKVM could continue using the KVM arm stage two code Android and cloud are very different use cases but this would be an increase to test and maintenance complexity to be managing two different stage two page table managers. All right, so nested virtualization as I mentioned earlier, it uses shadow paging so the TTPM Mew does not do shadow paging so on the x86 side, we have our sort of legacy MMU that did support shadow paging that's what we use for nested so if we adopted the TTPM Mew KVM arm we need to do the same implement a separate shadow paging infrastructure could we do architecture neutral shadow paging for nested virtualization that would be pretty difficult because shadowing, doing shadow paging is inherently very architecture specific because you're shadowing whatever the guest is doing and the guest could be using any architectural feature whereas in everything I've been talking about this talk we can pick and choose what features we wanna use within KVM that being said, power virtualization could be a path towards an architecture neutral nested support so some sort of pervert interface for doing nested virtualization we could build into that some sort of architecture neutral memory management for L2BMs but I do wanna note that the TTPM Mew does interoperate with shadow paging so it's possible to use, this is what we do on x86 we use the TTPM Mew whenever we're running the L1 and then whenever L2 is running it uses the shadow paging code and so there's some interoperation between those two because for example, when we're running an L1 we wanna write protect the hypervisors page tables all right, so in conclusion about 90% of the existing TTPM Mew can be made architecture neutral using the TTPM Mew for arm stage two is feasible but would require some changes to the TTPM Mew and comes with some caveats so as I mentioned, PKVM would not be supported initially and CPUs with the break before make less than two would, you know, we'd either not support them or would have to add break before make support which honestly wouldn't be too bad it's just a code complexity trade off so I'll be sharing the RFC patches to refactor the TTPM Mew out pretty soon probably within the next month and so I'd be curious to hear if any other architectures would be interested in this and what other changes would be needed and then also on the arm side with the PKVM that caveat there, what we can do or what the feedback on that is so that's all I had, so yeah, any questions? I think there's a mic so we'll pass that around. You mentioned that CMOs were needed for PTO dates I'm quite surprised because the page table worker is coherent with the CPU that's one of the many requirements I'm not sure what CMOs you're referring to Okay, yeah, this might have been just a poor misreading of the arm arm on my side if we don't need CMOs for page table changes No, no, we need CMOs for the data that is ported to by basically the actual page of data in some cases we do need CMOs but not for the PTOs Not for the PTOs, yeah, sorry that might have just been a typo then we need CMOs in certain situations For example, if you're editing a page you're unmapping a page you don't know whether the guest has mapped it cacheable or not cacheable so you don't know at what level of cache the data is and if your interconnect is not as great as your device that you'll be using to swap the page out is not coherent either you need to flush that to the TOC so that the page can be truly reused but that's the data, not the PTO Okay, yeah, thanks I had basically the same question but on the other comment about maintaining PKVM separately I strongly recommend not doing that That's a quick question Okay It's a quick question Does the TDP MMU place any restrictions on the input address size and the output address size for the page table? Like, does it require it to be the same as the host kernels stage one, for example? Offhand, I'm not sure I'm not aware of any Because it was just one you listed things like page size and stuff but we allowed the IPA size to be specified for the guest so we would still need that flexibility with the TDP MMU if we were to... Okay That's kind of code Okay, cool Any more questions, Howard? So for the PKVM thing would it make sense to have an architecture neutral kind of dumbed down TDP MMU that we could also use on X86 just for testing so that at least it doesn't break So we have two MMUs but at least they are all architecture neutral and better than having like... We reuse the same hooks just have two different commons code like to do with like 3RCU and a tiny RCU Maybe that could be a way to get it done for PKVM as well Yeah, it would be nice if we could... Perhaps if we could instantiate a TDP MMU and use the pass and flags to say whether or not you can use these features and you get a different flavor I think one of the things that PKVM uses quite heavily which I think normal KVM does not is that we use the page table as a source of tracking page ownership So we have a lot of invalid PTEs encoding lots of data and they might become valid and then we have to use software bits to track transitions between pages So we would need some things to be able to get at those bits as well So you're on the TDP MMU Yeah Yeah, so the way I've been thinking about it is the architecture code would own the PTEs as Paolo said So when you're instantiating like a new mapping the architecture code would construct the entire PTE to put in whatever it wants The TDP MMU, all it would do is take care of doing the walk down to the right table and sort of allocating along the way and then making sure that the... when you actually want to do the PTE modification it's done in a way like it uses a compare exchange and it does the appropriate walking and everything Chris Have you looked at whether this is beneficial or worse even for things like TDX or ACV? Yeah, so for TDX the direction they're going is to keep using the TDP MMU as is So they end up managing There's two parallel tables KVM manages the TDP pages like it does today but anytime it updates a PTE it does the same call into the TDX module to update the actual page tables that are used So the changes to the TDP MMU are pretty minor For SEV I don't believe there's much changes needed in the actual TDP MMU, Paolo, do you know? Yeah Yeah, but actually the TDX approach is kind of interesting I was curious if something like that could work for PKVM or... But the... Yeah, yeah, yeah There was a hand up in the back before that question, was that person still here? Question? I think they might have walked out after Thank you, thank you Okay, thank you