 Hi everyone. I'm Michael Roth. I'm from AMD and I'm here to provide a development status update on SCV-SMP support for Linux. So a quick overview of SCV-SMP and the features that it adds. This was introduced with the Zen 3 CPU architecture. Previously, before SCV-SMP, we had SCV, Secure Encrypted Virtualization, which provided guest data confidentiality through encrypted guest memory. And we also had SCV-ES, which extended guest data confidentiality by encrypting the BCP register state. And with SCV-SMP, we build on SCV and SCV-ES to also provide guest data integrity, and that's through the Secure Nested Paging feature, which is the namesake of SCV-SMP. But there's also some additional security features that fall under the SMP umbrella. These mostly relate to control flow security, so things like CPU ID security. So when you issue CPU ID instructions and the guest, those are generally emulated by the hypervisor. And in some cases, guest code might rely on well-defined behavior based on how hardware implements things like CPU ID leads. But in emulated context, some of those assumptions could potentially be broken. So there's some features to guard against that in the case of CPU ID, and also similarly with Interrupt Security and Secure TSE. So here's a quick overview of where we're at with the overall upstreaming status of these features. One big milestone that we recently hit was the guest SMP support for both the main Secure Nested Paging feature, as well as a CPU ID security, our upstream as of kernel 5.19. And the SMP security feature doesn't rely on any hypervisor support, so technically that's done as well. But the core focus at the moment is on the Secure Nested Paging hypervisor support. The latest patch set was version 6, and we'll be following that up with version 7 fairly soon. And most of this talk will sort of center around where we're at with implementing that, and some of the recent developments upstream related to getting that code merged. So before we get into how Secure Nested Paging works, just a quick overview of how Regular Nested Paging works. So with Regular Nested Paging, we have a guest page table that maps the guest virtual address to a guest physical address, and a nested page table, which then takes that guest physical address and maps it to a host physical address. And in the case of SEV and SEVES, there's a seed bit that could be set in the guest page table to to know whether or not that guest physical address should be treated as a private page or a shared page. And in the case of private pages, we use encryption to provide data confidentiality on access to the accesses to that table. So whenever the guest tries to write to a GPA with the seed bit set, the memory controller will encrypt that data using the guest's encryption key and then same for when we read from that encrypted GPA. So that mechanism for controlling whether or not a page is shared or private sort of lives inside the guest page table as far as the host, the nested page table, and accesses to the host memory pretty much appears just like regular guest memory, except if the host tries to write to that, it's just going to end up writing garbage because when the guest reads it, in the case of a private page, it's just going to try to decrypt it using its encryption key and then it'll end up reading garbage because it wasn't originally written using the guest's encryption key. So that provides data confidentiality because you can't get the contents of what's in those guest pages due to encryption, but you're still susceptible to things like remap and replay attacks because since the host still has full control over the nested page table and host memory, it can do things like potentially save an old copy of a particular guest page and even though it's encrypted, maybe due to the time of boot that page was saved off, an attacker might have a good idea of what's supposed to be in that page. So if he then tries to write that page back into the guest by writing directly to the host memory later, potentially he could try to manipulate the guest in that way or he could just straight up write into the host memory to corrupt the guest data and the guest won't be aware. But in reality, there's actually some mitigations in place that guard against most of the more common sorts of attacks of that sort, but with enough experimentation you could potentially get the guest to do something that it's not supposed to. So that's where the secure nested paging comes into play. So in the case of secure nested paging, we still have the guest page table that maps the GBA to the GPA and then the nested page table that maps the GPA to an HPA, but now there's also a reverse map table which maps the host physical address back to the guest physical address and also provides some additional metadata that could be used to provide some additional integrity checks on accesses to that page and the host physical memory. So to get a better idea of how this works, let's take a closer look at the data that's actually in those RMP entries. So for instance, in this case, guest A has a private page at GPA 2000 H and the nested page table maps that guest physical address to the host physical address 7,000 H and oh, I just realized my cursor doesn't show up there and so and then 7,000 H, the host physical address then has a corresponding entry in the RMP table with some additional metadata that tells the that the hardware uses to enforce certain restrictions on on accesses to that page. So here you have the assign bit set which tells the hardware that host physical address 7,000 is assigned to a guest with ACID 2 in this case and the guest physical address that it should be mapped to in the nested page table by the nested page table is 2000 H and one important detail here is the validated bit. So when the host sets up the RMP table entries for the guest, the host can use this RMP update instruction that can pretty much manipulate all of the entries in this table but the real security comes from the fact that it can't set the validated bit in that table. So in order for the validated bit to be set, the guest prior to using the page has to issue a pValidate instruction and when it issues that pValidate instruction, that's when the validated bit gets set and then from that point forward, if the host tries to manipulate the RMP table so that it can manipulate the guest data, the validated bit will be on set and then if the guest tries to access that address later on, it'll get a virtual communication exception, pound VC and generally the guest will terminate itself in that case. So let's take a closer look at what is involved in implementing this secure nested page table support in KBM. One thing we need to do is we need to set up the memory that the hardware will use to store those RMP entries. Another thing is when the guest is created and set up, we need to make sure that those pages are pinned and there's actually a SCV and an SCVS have a similar requirement in this regard because the encryption algorithm that's used to encrypt those pages is sort of tied to the physical address that backs those pages. So we actually already have a KVMI octal that's used in those cases to register those pages that might be used for encrypted guest memory and also pin those. So in the case of SMP, we reuse that KVMI octal but we also need to add some additional handling to make sure that we update the RMP entries to correspond to the initial state of guest memory. So when you initially boot a guest, you'll have the initial guest image like the OVMF ROM and those will be stored in guest memory as encrypted pages. So we need to make sure that we update the RMP entries to reflect that before booting the guest. And then after booting the guest, the guest can potentially issue page state changes in the form of GHCB requests to change the state of the page from shared to private or private to shared. So we need handling in KVM to deal with that as well. Another thing we need to deal with is the RMP fault handling. That's sort of the mechanism that's used to enforce the restrictions that are placed on host memory accesses. Those can be surfaced in the form of a host page fault in the case of host threads that are trying to write to particular host physical address. And then we also have the analog for when a guest tries to access guest physical addresses if those accesses don't align with the state of the RMP, the corresponding RMP entry, then those will result in nested page faults, which we also need to add handling for. So let's take a closer look at the host page fault handling. So in this particular case, we have thread A, that's a vCPU thread that's running in guest mode. We have the guest page table that maps a 4 kilobyte GPA as a shared page, and the nested page table maps that to address 602,000 H in host memory. And we also have, in this case, thread B, which could be a user space thread like the VMM process itself that's running the guest, or it could also be a kernel thread. And in this case, that thread has a huge page mapping that overlaps with that same address, 600 and 2,000 H. And in this case, everything's fine because the page in the guest page table doesn't have the C bit set. So it's just a shared page, so it's treated like any other host page. And there's no issue here, but if you attempt to, if you switch the page to a private page, then you'll have problems, because in the case of thread B being a user space thread, because that huge page mapping overlaps with the private page, if you try to write to any page within that 2 megabyte range, even if you're not overlapping with that particular private page, you'll still trigger an RMP fault, because as far as the hardware is concerned, you're writing to that 2 megabyte region, which contains a private guest page. And in the case of user space, that's fairly easy to deal with. We just split the mapping in the user space page table, and then at that point, there's no overlap, and the process can continue executing. But in the case of a kernel access, for instance, if the kernel has a completely unrelated thread that's trying to access some subpage in that range using the kernel direct map, because the direct map uses 2 megabyte entries by default, in that case, you'll get a host page fault for that kernel access. And in that case, it's basically a host bug, there's no way to dynamically split the host arm, the direct map dynamically like we do in the user space case, but we do have interfaces to split the direct map in advance. So that's how things are currently implemented, or at least that's how things will be implemented in version seven. We use a similar approach in version six of the SMP hypervisor patches, but instead of directly splitting the direct map using set memory 4k, we actually unmapped the private page from the direct map, and then that ends up splitting the direct map in this particular case, because if you remove one entry from the direct map, then you have to split it in order to remap all the other entries that you didn't remove. So in either case, we basically end up splitting the direct map to avoid this scenario here. And another situation we need to deal with is in the case of just a normal 4k write, if you try to write to that particular page, then you'll get an RMP fault because hardware enforces that the host can't arbitrarily write garbage to private guest memory. And in the case of user space, if this situation happens, like the VMM tries to write to a page that the guest has flipped to a private state, then we'll basically just signal the process to terminate using SIG bus. And currently in version six of the hypervisor patches, if the kernel tries to do the same thing, we'll basically just crash the host because that's considered a host bug, and it's better to crash the host than to silently let it do this. But in version seven, we'll likely be changing that behavior because there's a situation here where maybe the kernel thread that's trying to access that page is trying to access it for something like a KVM clock or a Verdi O buffer where the host thinks it's supposed to be shared, but then the guest tries to maliciously switch it to a private page so that the next time the host tries to access it, it generates that page fault. And we don't want to crash the host in that case. So for version seven, we're exploring an approach where we just flip the page back to shared if the host thinks it's supposed to be a shared page. And if we do things that way, the guest, in the case of a malicious guest, if that ends up breaking things, that's okay because a malicious guest or a buggy guest because a guest isn't supposed to be switching pages to private out from underneath the covers. So that's undefined behavior. It's okay if the guest breaks in that case. But we do need to watch out for the case where maybe the host does this in error. It flips the page to a shared page because it thinks it's supposed to be shared, but it's actually a bug in the host. And if that happens, we don't want to silently corrupt the guest's memory. But because of the way S&P handles things, if we flip the page to shared and the guest isn't expecting it to be shared, it's expecting it to be private, will unset the validated bit in the RMP table. So the next time the guest tries to access that page, it'll see that that validated bit was unset, and then it'll get the pound VC exception so it can terminate itself. So that's how we handle things for host page faults or nested page faults. There's some similar checks in place, but things are a little bit different here. So in this situation we have thread A which has a two megabyte mapping for GPA 200,000 H to 400,000 H, and then the nested page table in turn maps that as a huge page to host physical address 600,000 H through 800,000 H. And then we also, in the RMP table for the RMP entry corresponding to 600,000 H, we set a bit in the RMP entry that tells the hardware that the data in that RMP entry should apply for that entire two megabyte range. So that helps speed up the RMP lookups for the optimal case here. But in order for this to work, when the guest validates that GPA range, it needs to issue the pValidate instruction with the huge page bit set to tell hardware that it does have it mapped as a two megabyte page. And if it does, then everything works fine here. But if the guest actually tries to validate this, that GPA range using anything other than the huge page, for instance, it tries to validate it as a 4K page, then, you know, because possibly it's just not an optimized guest or because it actually has that GPA mapped as a 4K page in its host page, in its guest page table instead of a two meg page, then in that case we'll get a nested page fault. And to handle that, we need to split the nested page table mapping so that it no longer maps that GPA range to a huge page and instead maps it as individual 4K pages. And then we also need to issue a piecemash instruction to split this RMP, this two megabyte RMP entry into individual 4K RMP entries, at which point the guest can retry the pValidate instruction and in this case it'll succeed. And there's another set of checks here with the seed bit. If the guest tries to do an implicit page state change by instead of issuing a GHCB request to tell the host that it wants a particular page to be shared or private, it might just update the seed bit in its page table. And if the state of the seed bit in the guest page table doesn't match the state of the page in the RMP table, then that'll also result in a nested page fault. And in that case, the nested page fault will have the RMP bit set as well as a bit to indicate what type of access the guest was trying to make. And to deal with that, we have some handling to update the entry in the RMP table to match what the guest is expecting. And so that's how things are sort of implemented for the secure nested paging support for the current version of the SMP hypervisor patches. And the handling there hasn't changed much since version five. There are some changes that we're looking at for version seven, but things are mostly sort of stable in how we have things implemented. But there's a new development upstream that we've been looking at, and that's called UPM, Unmapped Private Memory. And UPM basically refers to some proposed kernel infrastructure that's used to back confidential guests with pages that can't be accessed by user space. And the initial implementation of that UPM support is a Chao Peng's Private Memslot patch set. So when I refer to UPM, I'm basically referring to the changes introduced by that patch set. And UPM has been proposed by a number of different people for a number of different reasons, but as I understand things, sort of the main driver for UPM is Intel TDX, where in the case where user space tries to write to a private guest page, in our case, we just killed the guest. We get a page fault, and we signal the process to terminate. But in the case of Intel TDX, that results in a machine check, which would crash the host. So it's important in that case to have some infrastructure in place so that user space can't write to guest memory, guest private memory. And so we've been looking at how to leverage that for SMP, and there's also, I think, a prototype that was recently implemented for PKVM that makes use of it as well. So just a quick overview of what UPM, what the flow of things look like with UPM. So with UPM, you have this new private Memslot structure. So before getting into that, in the case of a normal Memslot structure, if you have a nested page fault for a guest, and you need to do a GPA to HPA lookup to figure out what host physical address to map into the nested page fault, into the nested page table to handle the nested page fault, you take the GPA, and you use that to basically index into the Memslot and figure out what HBA is supposed to back that GPA, and then you walk the VMM processes page table to get the host physical address that that corresponds to. And then that's what you program into the nested page table. And that happens the same, both for shared pages like 2000H in this case, or for private pages like 3000H, the handling is the same for either. But in the case of the private Memslot structure for 2000H, the shared page, the handling is the same. But for 3000H, there's now this X array structure that's used by the KBM MMU to check whether or not the GPA is supposed to be mapped to a normal page or a private page. And in the case of a private page, when we try to look up the corresponding host physical address, we will use that GPA to get an offset into this special MemFD back end. And that MemFD back end can then be used to get the host physical address to program into the nested page table. And that special MemFD back end has some safeguards on it so that user space processes can't read or write to those pages. And they also can't use MMAP to map them into the processes address space. I mentioned implicit conversions, implicit page state changes earlier. This is sort of how things look like when we implement them with UPM. So in this case, you have the GPA 3000H, which in the X array was originally mapped as a shared page. But in this case, the guest has flipped the C bit on so that it's now a private page. So when that happens, you get a nested page fault. And the KBM MMU, when it gets that page fault, it'll look in that X array and see that the current state of the GPA 3000H is shared. So in response to that, there's now this new KBM exit memory fault, which will exit to user space to the VMM. And then the VMM in response to that will make sure that there's a page allocated in the MemFD using an fallocate syscall. And then the VMM will issue this KBM MMCrypt register region iOctl to tell KBM that that GPA is now backed by a private MemFD. And at that point, if you restart the guest, then the next time it generates a page fault, the KBM MMU will be able to map it to 8000H without having to exit to user space. And we also have the case of explicit conversions, which in the case of SMP are in the form of GHCB requests. So in that case, we don't need the KBM MMU to determine whether or not it needs to call out to user space because the guest is specifically telling us that it wants that page to be treated as a private page. And normally in the current SMP patches, that request is handled completely in the kernel. But with UPM, we have a new KBM exit VMG exit, which basically forwards that GHCB request to user space. And then user space in response to getting that will basically do the same thing that it does for an implicit conversion. You may reuse the, okay. So yeah, the question was, is there any reason to not to reuse the KBM exit memory fault exit instead of introducing this new KBM exit VMG exit? And yeah, that's a good question. And that's something that we've been looking at. The main reason not to do that currently is the GHCB requests can do batching. So you could have over 200 individual page state change requests in the GHCB buffer. And by batching them, we get a little bit better performance. But yes, something I brought up in the LPC micro conference. We did a talk there on UPM was potentially adding the support for batching to the KBM exit memory fault exit so that we can basically take that GHCB request and package it up into this KBM memory fault using like a scattergather list or something like that. And if we do things that way, then maybe there could be some common handling between SMP and TDX if we both end up taking that approach. So yeah, that's something to consider there. So that's sort of what the flow of things look like with UPM. There's a lot of different pros and cons to consider. Most of the cons are sort of in the user space implementation of this, because now you have the normal memory allocations and then you also have this private FD. And if you're not careful to deallocate pages from the private MEM FD in the case where you're flipping from private to shared, or deallocate from the normal memory in the case where you're flipping from shared to private, then you could eventually end up using twice as much memory. But sort of the flip side of that is if you do that for every single conversion, then there are some cases where allocating and deallocating frequently because pages are getting flipped frequently can cause some performance issues. Like there's one case in OBMF where to handle bounce buffering, it will flip a page to shared, do the DMA, and then flip it back to private, and it will do that hundreds of times, and all of those allocations and deallocations can hurt performance pretty dramatically. So there may be potentially some balance that you might want to strike on the user space side where you don't necessarily deallocate and reallocate on demand. Maybe you have some threshold where after a certain point you decide that it's time to discard pages from one backend that are no longer being used. And thank you with that. We're a time. Well, that's what we're looking at currently with UPM. And currently in version seven, we won't be utilizing UPM yet, but we're sort of working with the community to see if we can make that workable, but otherwise we're happy to stick with the non-UPM solution as well. Well, sorry for going over time on that. We should have kept better track. If anybody has any questions, feel free to grab me after. If anybody doesn't have a better question, would you make a one-minute summary of the UPM discussions as slumbers, so that they also get recorded for people who don't have slum yet? If anybody doesn't have a better question, make it one-minute. Yeah, so yeah, the summary was basically, yeah, so it wasn't necessarily the UPM implementation as it exists today, except in the case, we did mention the batching support for KVM exit memory, I think some folks from Google there, we're sort of interested in that solution so that we don't have multiple different sorts of implementations that need to be done in user space. But there's a couple questions around things like how to deal with the kernel DirectMap. I mentioned that we split it currently for SMP and I think for TDX there's similar requirements where they may need to split the DirectMap, so maybe that's something that UPM could potentially address as well, and I think those are the main ones.