 Hello, everyone. My name is Cantin. I work for Google. I'm part of the Android Virtualization team. And today I want to talk about the work we've done on Protected KVM on 964. So PKVM, as we call it, has been mentioned a thing several times during the conference. So hopefully by now most of you should have at least a rough idea of what PKVM is about. And if you don't, the idea of PKVM is it's basically an extension of KVM for on64 that provides confidential computing features, even on CPUs that don't necessarily have additional fancy hardware support for it. So PKVM is to some degree comparable with TDX or SEV, for instance, in terms of what it tries to achieve. It's all about protecting guest memory from the host, but it's quite different from TDX or SEV in how it achieves that. And it's precisely the how we made that happen that I want to talk about today. So why talk about PKVM? So first of all, I think it's pretty cool. It is a slightly biased opinion, obviously. So I'll let you decide. But it's also the first birthday of the code. So it felt like a good time to look back and see all the things we've done over the last year. I'm also hoping that this talk will be inspiring to other architectures. And in fact, I have seen that some of the concepts that we have came up with for PKVM, I've already been inspiring for other architectures. So I'm hoping that this will help even more. And finally, we have a relatively large patch series out there which implements most of the things I'm going to discuss. And so I'm hoping that this talk may bring some color and maybe help a little bit the review of this patch, which is quite big. And there are still open discussion points that I would like to discuss as well. So quick disclaimer, as opposed to the previous talk, this very much intends to be a technical talk. So here's the cake if you were waiting for it. What this means is I'm not going to be talking too much about why we're doing any of this, why Android is interested in confidential computing or things like that. But I'll be talking about what we actually do and how we make that happen. If you are interested in those things, please feel free to reach out at the end of the talk. And there are also resources that I have listed here on the slides. The slides themselves have quite a bit. I'll try and cover as much material as I can. But hopefully they can be useful as a future reference as well. So PKVM. The first thing I've said is PKVM is an extension of KVM for 64. So before I can explain how we extend it, I think it makes sense to start by talking about how PKVM works in the first place. In this picture, I'm showing the KVM setup for the so-called NVHE execution mode of KVM. We have two different modes of execution for on 64. I will be focusing on the NVHE part today just because it's the one that is relevant to PKVM. So as you can see, the ARM architecture defines multiple exception levels. So exception level zero, which is typically where user space runs. EL1 is for the kernel layer. EL2 is the hypervisor layer. There is also an EL3 that is defined in the architecture for the firmware layer, but I'm not going to talk about it. So on the left side, we have host user space and right side, the guest user space. The one interesting thing to notice with this setup here is that the host in external is running at EL1 and not at the hypervisor layer, which means that KVM itself does not have directly access to virtualization features. The way we enable virtualization of guest is by having a separate piece of code running at EL2, which we call the KVM world switch code, which as the name suggests, is meant to context switch between host context and guest context. The KVM world switch code is part of Linux, technically. It is a Linux source code. It is built into Linux, linked into the image, but placed into a different section of the EL file. And it is essentially left behind during boot. And it is just responsible essentially for context switching to guest when the host requests it. Another thing to notice here is that when we're running on guest context, we have the Stage 2 translation, which is enabled. Stage 2 MMU is in the ARM world how we do the guest physical address to host physical address translation in the hardware. And interestingly, the host doesn't have a Stage 2 MMU on when it's running, which means the host has access by default in normal KVM, non-PKVM. The host has access to all of memory, which includes guest memory as well as hypervisor memory. With PKVM, we're basically using this somewhat unusual setup to our advantage. And we're basically extending the EL2 component that sits at the bottom of the picture here, and we're making this hypervisor code a little bit smarter. And we're giving it the ability to enable the Stage 2 translation even when the host is running, which puts the hypervisor in a position to enforce access control restriction to pages, arbitrary pages in the system. It can, thanks to this, prevent the host kernel from accessing guest memory or accessing hypervisor memory. The reason why we've designed PKVM as an extension of the NVHE mode like this is because this means we can have the hypervisor in the kernel in the same image, which has a few different benefits. First, it's great for updateability. In the Android world, we've put a lot of efforts into kernel updateability, deploying updated kernels to devices on the field. We don't want to just reinvent the wheel for the hypervisor. So if we can just have the hypervisor package with the kernel, we get that for free. Since the hypervisor and the kernel are part of the same image and they are, by construction, updated atomically, which means that we don't have to keep a stable ABI between the host kernel and the hypervisor. That gives us a great deal of flexibility to invent whatever mechanism we can come up with without having to fear for background compatibility. And finally, the hypervisor is open source. It's quite a nice thing to know that the code that's running at high privilege level on your system is something that you can look at, fix, audit, improve. We can come up with any sort of security feature hardening on top of it and the community can benefit from all of that. So this is all great. How does it actually work? In order to do that, I would like to introduce you to your best friend for the next 25 minutes, which is this little character in the bottom left corner. This will represent our user. And what I would like to suggest is that we follow that user as they interact with their device and we'll make a few stops along the way and look at how PKVM gets involved in your interaction of the user with the device and a few different places will take small tensions to talk about interesting things that hypervisor does as well. The first thing that our user will do is to simply just power the device on and put the device. The PKVM story, I would say, starts pretty early when we boot the device. We have the expectation that the bootloader will be responsible for checking that the boot image that is loaded in the device is effectively the one that vendor intended to load using signature checking and all of the industry stuff and all things. The kernel must be entered by the bootloader at ER2. That's a requirement for KVM on ARM in general, not just the KVM. And one of the very, very first things that a kernel will do when it is entered at ER2, when I mean one of the first things is like in the first, maybe 20 instructions or something, it will detect that it is running at ER2, install what we call the subvectors, the exception vectors at ER2, which are dummy exception vectors basically, which do nothing, and then ERET 3L1 and proceed to boot from there. A little bit later, we'll hit the point where we set up the memory management subsystem in Linux. And the hypervisor is going to need some memory for itself in order to manage a few different things, including its page table, its own set of stage one page table, as well as the page table of the host. It's going to require memory for that. And so the current implementation that we have sort of uses the member API fairly early on to allocate a pool of memory that will be donated to the hypervisor later on. And the amount of memory that we have to reserve is a function of how much memory there is on your system because the page tables, the size of the page tables, you can on that. Later on when PKVM initializes and I should mention on ARM PKVM, sorry, KVM is not modular, so we cannot load it as a module. It has to be built into the kernel. So this happens before we reach user space. When we reach the point where KVM tries to initialize, in the context of PKVM, we have to do a few additional steps. So we expect the host to allocate a temporary stage one page table for the hypervisor which will come handy when we want to initialize the hypervisor because we can use that page table to turn the MMU on immediately, which is really, really useful to bootstrap the hypervisor. And host allocates a few more things, such as the stack score for EL2 and personal few pages. And then it will issue a hypercall to EL2, which will replace the dummy stub vectors that we have installed, replace them with the PKVM stub vectors, and then the hypervisor can start sort of bootstrapping itself. First it will recreate its own stage one page table using the pages that have been reserved in the memory pool I mentioned earlier, and it will also initialize the stage two page table for the host and unmap itself from the host stage two. So it will unmap or will unmap the text section of the hypervisor or data section, as well as the memory pool type that I mentioned earlier. And from that point, the hypervisor will return to the host, which will proceed to boot as if nothing happened. Except that at that point now, the stage two page table of the host is enabled, and in fact the host is no longer able to access certain pages in the system, specifically the hypervisor pages, which makes our little frame here really happy. Before we get back to our little user and see what is going to do next, I would like to take your quantity, I would like to focus on something that I've mentioned here. I've explained that the hypervisor will map itself using the stage one page table and unmap itself from the host stage two. Conceptually, this means that the hypervisor is effectively taking ownership of those memory pages. In the context of PKVM, it's the hypervisor's responsibility to track which pages are owned by which entity and the system at any point in time and enforce the access control restrictions that are required. And we have, to some extent, formalized this in the context of PKVM, so I just want to spend a few minutes to talk about memory ownership tracking. PKVM says several possible types of owners for memory, so the host may be owning memory, the hypervisor may be owning memory, guest VMs when they are protected may be owning memory. There are potentially other entities not going to describe them too much, but you can think about trust zone and things like this. And from the point of view of each of these owners, the pages in the system can be in either of, in one or four states. Either the page is completely owned by that entity, which implies that whoever is the owner has exclusive access to the page or the page can be shared. And you can either be the provider of the share or the recipient or the borrower in this case. And finally, we have a state to imply that the page is no longer accessible, which is what happens, for instance, when we unmap something from the host. The way we track ownership is implemented at the page table level itself. The hypervisor maintains the page tables for its own page table as well as the guest page tables and the host page tables. To see which pages are owned by who and shared with who, we use software bits in the page table entries. So we have four bits defined in the architecture that our result for software use. We use that exactly for that. We have sort of restricted sharing to only be sort of a peer-to-peer thing for now, for simplicity. So we don't allow end-way sharing, where you will have three or more entities sharing a page. That's for simplicity reason. And I mean, end-way sharing has its own set of problems. So we've simply started simple. An interesting thing to say as well is that the host stage two page table is identity mapped. So we never use translation. We only use access control restrictions for mapping memory into the host stage two. And that lets us do some interesting things. When there is an invalid page table entry into the host page table, so if bit zero is cleared, it means that bit one to 63 will be completely ignored by the hardware. So we reuse them to store metadata about the pages. And specifically what we do in this case is we store the owner ID of whoever is the owner of the page that corresponds to the mapping. We've defined the host identifier to be zero, such that basically if you have a PTE in the host stage two that is just completely zero, this encodes ownership to the host. Now that we've defined the state, we have also sort of formalized the way we can do the conversions or, you know, we can transition from one state to another. Donation of pages, all of those transitions happen in two steps. We first check that we have a certain invariant that is correct into the page table of the initiator of the transition and the completer of the transition and then apply the changes to the page table. For donations, we expect that the initiator of the transition has exclusive access to the page that is being donated. And obviously it's sort of implied that the completer doesn't have access to it and then apply this. I'd like to emphasize here that we don't accept a donation for a page that is shared. For instance, we don't have like transitive sharing or anything of that thing. Before a page can be donated to someone, it has to be unshared first to reclaim to get back exclusive ownership. Sharing a page, we have the same initial state that obviously we will go and map pages in both page tables with the appropriate mappings in the page table into the software bits. And the unshared obviously is the reverse operation. All right. So now that you know everything there is to know about memory ownership tracking in PKBM, we can go one step further. So our user has been able to put the device and now it says, oh, it's great. I have a PKBM enabled device. I should try and create a guest and well, that's what that's what we'll be doing. When we're creating a guest with PKBM, the host has a few extra steps to do. First, it needs to allocate more memory and you will see that this is actually a common team. Every time we need to do something with the hypervisor, one of the first things we have to do is allocate memory that we're going to donate to it. We are running the hypervisor with as little memory as we possibly can and we donate memory dynamically to it and reclaim it when that's possible as well. So the host will allocate memory and then issue a hypercall to tell the hypervisor, please, I would like to create a guest and here are the pages you can use in order to store the metadata for that DM. The hypervisor in that case will convert all of those allocated pages to hypervisor private memory with a host to hypervisor donation as I've just described and then it will allocate what we call a shadow handle and then initialize all of the ER2 private data structures which include a struct KVM and a struct BCPU for HBCPU as well as the root of the page table and then return that handle back to the host. The handle that the hypervisor has allocated is what the host can then use in order to talk to the hypervisor when it talks about the VM. So it's comparable to the VMFD that user space has when you create a VM with an eye at all. Maybe one point to make. So a lot of the initial past series we've posted used the notion of shadow data structures. I think when the process of renaming those data structures to not be called shadows anymore. So I think this might even be sort of out of date, out of this morning because I think we'll as opposed to my series. The shadow terminology is consistent with the series as a time where I wrote the slide. So this is basically what it looks like. When the host creates a guest, first it will allocate its own KVM struct, KVM struct BCPU and all of that stuff and one of the members instruct KVM into the ArchCode will be the shadow handle. Once the EL1 has asked the hypervisor to create those things the hypervisor will have its own private copies, well not copies, but those own private instances of similar data structures with what we call the KVM shadow VM which includes a struct KVM plus a few other things and same thing for the BCPUs. One of the things that the hypervisor has as well is backpoints to the host data structures. As I've explained earlier we are running the hypervisor with as little memory as we possibly can and in fact the whole hypervisor runs at EL2 in a fairly constrained and limited environment. We can't just run kernel code, it's a separate exception level and EL2 is a fairly restricted place where you can run code but over the years we've been, I mean over the last year essentially, we've been adding infrastructure to the hypervisor to try and make our life a little bit easier and most of the things that we've added that we've added are concepts borrowed from the kernel and the idea is really to you know if you're a kernel hacker you should somewhat feel at home when you're working with the EL2 layer. So we have spin locks, we don't have new text because the hypervisor is completely non-preemptible, that's a hard constraint. We have debug features which allow us to assert that certain locks are held in certain places. We have a page allocator, it's quite limited in how it can be used just because we have very limited memory constraints. We have the first CPU infrastructure, we have a VMM map as well. We have a struct height page which is really really small but it's quite convenient for some things and we're working on interfacing with the host tracing subsystem as well to have the hypervisor emit trace events into the host trace but for all of those sort of things. So there's a lot of things we can already do at the EL2. All right, so we have put the device, created the guest, now it's time to run it. So KVM has the notion of vCPU load and vCPU put. The idea of vCPU load and vCPU put is to make a vCPU essentially resident on a physical CPU as a way to optimize the vCPU run loop. PKVM uses a similar model, we have hypercalls for vCPU load and vCPU put but there is a little bit of added semantic on top of that. In the context of when the host tries to issue a vCPU load hypercall, the hypervisor will have to do a number of, well, a few security sort of sanity checks before it will allow that to happen. For instance, it must make sure that that vCPU is not currently loaded somewhere else or that that vCPU, we don't already have a vCPU loaded on the physical CPU, these type of things. And then subsequent hypercalls such as vCPU run will require that the vCPU has been loaded before and this will be a strong hard requirement. And vCPU put when we release the reference we have on the vCPU will sync what we call sync the vCPU shadow state with the host for a non-protected VM and we'll drop the reference and then clear the sort of EL2 tracking we have, the per-CPU tracking we have to track the loaded vCPUs. Just to talk about signature flush quickly, these are names that were traditionally used in KVM ARM, they may not be used for, I mean, clear to everyone, so the ID of the sync is going from the EL2 state back to the EL1 copy, which typically happens after we've run a vCPU and we're going to copy part of the state back into the host so it can handle a fault, for instance, and sync and flush it the other way around. So okay, we've loaded the vCPU, now it's time to run it. The vCPU run hypercall in the context of pKVM takes essentially no parameters because vCPU run expects that the vCPU that we want to load is already loaded on the physical CPU, so whatever is passed as a parameter will be ignored. In this case, the hypervisor will flush part of the vCPU state, which means it might, it will copy a very small, it will cherry pick a very small part of the vCPU state that's coming from the host that it believes can be trusted or is, you know, not going to cause security problem, and then context switch to the guest but using the EL2 copy of the vCPU and then e-write into that guest. When we exit back, then the host will sync part of the state, which means copy some of the state from the shadow vCPU straight back to the host. When the guest exits, there are some things we can handle at EL2 directly. For instance, the FP state is switched between the host and guest transparently, sorry, lazily. So this is kind of things we can do directly. We have some logic for the big stuff that can be done as well. But it's more the exception than the rule. Most of the exits have to be handled by the host. And in fact, because the host still have its own KVM struct and vCPU struct, and we only populate the state that we care about into that thing, most of the host-side handling is effectively unmodified with the KVM as just very similar to normal KVM. One notable exception, however, is memory adults. So let's imagine that we are running at EL2, we have a guest that is exiting, and we manage to handle that exit every time and re-until the guest. At some point, we'll get a data abort or an instruction abort coming from the guest. What the hypervisor will do in this case is copy the ESR, which is the exception syndrome register, and the GPA. It will copy that back into the host vCPU struct and then return to EL1 saying, please handle the fault for me. As with traditional KVM, we will expect, we'll look up the memslot and convert the GPA into an HVA. And in the current implementation that we have, instead of just doing a gup and mapping things into the guest, we will have KVM take a long-term gup pin on the page that is returned, that is found on the HVA page. The reason we take a long-term gup pin is to prevent swap and page migration, because we are about to donate that page to the guest and lose access to it, so we don't want KSM or whatever to mess with it. I'll talk a bit more about that later. Once that is done, we need to have KVM top up a per vCPU memcache similar to what we already do with the MMU cache and then issue a pKVM guest map hypercall. At that point, the hypervisor will try and top up its own vCPU memcache from the memcache that the host has provided. Every time we will take a page out of the host memcache and into our old memcache, into the hypervisor memcache, we have to go through a full host to hypervisor donation to make sure we have the right environments in place. Once the hypervisor has enough memory into its memcache to create guest mappings, it will go and modify the Guest Stage 2 page table for protected guests. All of the pages will be donated completely to the guest, so we'll also unmap from the host for non-protected guests, so traditional KVM guests will just do a share and retain the mapping in the host stage 2 as well. Once that is done, we can then return to the host, which can proceed to bring the vCPU run hypercall and rinse repeat. I've mentioned how we handle memory and vaults, that doesn't tell us how we do MMU exits. There are a few problems with that, which require some additional work. One of the things is that the hypervisor has no understanding of mem slots, which just don't have that concept at all at year 2 right now. For fairly obvious confidentiality reasons, the guest registers are not copied back to the host kernel when we exit the vCPU, so it makes MMU handling a little bit complicated. We had to expose from the hypervisor a set of hypercalls to guests to allow them to declare their MMU ranges basically, so a guest can issue the hypercall to say, I'm willing the host to handle MMU for me in this region of my IPA space, or GPA space, and then the hypervisor, when it says exit in those regions, it will use the R0, sorry, as a transfer register. In a similar idea, we have also exposed to guests share hypercalls, which allows the guest essentially to share pages, like protected guest to share pages back to, with the host to allow communications, so typically a virtual communications and things like that. It is a little bit interesting in some cases, because all of guest memory is paged in lazily, which means that the guest may try to share a page that it doesn't actually have. In that case, we need to have the hypervisor essentially, we have the hypervisor to fudge the exception syndrome register before it returns to the host to pretend that this is in fact the data I bought from the guest, to trigger the full handling path into the host, get the host to donate the page, and then replay the AVC on the guest site on the next VCPU run. A few notes on how we load things into the guest, because it ties into how memory gets mapped into there. The expectations we currently have is that the device bootloader, which is the Android bootloader for us, has to copy a trusted guest bootloader into a result memory region, which will be described into the T of the host, and that memory region will then be unmapped by the hypervisor when it initializes. So when I said earlier that the hypervisor unmapped itself from the host, it also unmapped the guest bootloader. And the VMM can then specify when it creates a guest, I want the bootloader to be loaded at this particular region of the guest IPA space. And the first time the VCPU is run, we run the guest, we will force the program counter of the guest to be into those regions, to fold 10 pages into that range. And as the hypervisor will see those pages being donated by the host, it will copy the guest bootloader into the guest, which can then proceed to load the guest payload in a test that whatever is being loaded into the protected guest is what it is expected. So I've been talking about guests a lot, but the interesting thing with PKVM is we have not only to handle guest faults, but also host stage two faults, because this might happen. In the implementation, we have the host stage two mappings are created lazily, just like they are for guests, essentially. And every time we take a fault at ER2 because of a stage two fault, we have to check the state of the page by walking the host stage two page tables. If we find that the PTE that corresponds to the address we're trying to access is invalid, we will check bits one to city three, as I mentioned earlier, to see who is the owner of that page. If the owner of that page happens to be the host, we can just create a valid mapping and reply. If the PTE is valid, that we found it probably means we're just right with someone else who has done that for us just before, so we can just re-enter the host and hope for the best. But if the PTE isn't valid and the ID we found in the remaining bits is not the one of the host, then we have a problem. It means we've got the host accessing private memory and this is where the fun begins. Yeah. So the handling that we have implemented in the current PKVM is basically that we have the hypervisor just re-inject the exception back into the host and have the host handle it. If the fault was taken from EL0, it means that it was user's page that accessed pages that have been donated to a guest. So that's typically going to be the VMM that still has a mapping on the page that has been donated. It tries to access that and obviously something has to be done. In that case, the hypervisor will just re-inject the exception back into the host and set an additional bit into the exception syndrome register to make sure that the host handler can distinguish that exception from just a normal like EL0 paging fault. When we see that at EL1 in the host kernel, we'll say, oh, the hypervisor is telling me that that guest has been accessing memory it shouldn't and I really don't know what to do. So I'll just save the user space process that did that. If the fault was taken from EL1, however, it's a little bit more complicated. In this case, we have the hypervisor sort of repaint the exception syndrome register to pretend that the fault we've taken is a same level fault and then reenter the kernel saying, there you go. You've caused yourself to enter the exception handler. Deal with it. There are cases where the kernel can handle same level faults, but not that many. So for instance, if let's imagine that we have a compromised user space, a compromised VMM or malicious VMM, for instance, that tries to do a syscall and it passes as parameter to that syscall memory that belongs to a protected guest. In that case, the kernel will access that memory in using copy to user, copy from user and things like this, which means that we will be in position to handle the same level fault and actually just fail the syscall. But if you take the same example and have a process strace, this malicious VMM, then the strace process will try and inspect whatever memory is being passed to the system call. And it will do that using an example here, using the process VM review syscall, which will do get user pages remote if our memory is correct and then access those pages through the linear map. In that case, the hypervisor will say, okay, you can't access those pages, orient or the kernel, but we're not in a position to actually handle the fault. We'll just have to bring the machine down. This is, in my opinion, the biggest remaining problem with the host side implementation of PKVM. I think the hypervisor is now getting pretty solid, but there is still work that needs to be done on that front before we can end this upstream. One promising solution to that problem would be to extend the private FDE proposal that came from the TDX developers. It has a lot of good features for what we're trying to do here. It should prevent swap and page migration, and it should, if done correctly, prevent the kernel from accessing just memory through side things, and also offers potentially a suitable API in order to implement hypervisor-assisted page migration later. However, we really need to have support for what we call in-place conversions, which means non-destructive shares and shares of donations. Another thing we're currently looking at is whether SecretMMM would be actually a good option for this, because we don't necessarily mind keeping guest memory mapped into the VMM as long as only the VMM can access those pages, because we can save it cleanly if that happens. The biggest problem is when the kernel has accesses those pages, so maybe SecretMMM extended with some of the features provided by the private FDE proposal would be a good option. Another possibility, but which is not compromising, would be to have the hypervisor just kill the guest. That is slightly ugly, but it should work. The idea would be that the hypervisor, when it sees that the host accesses the guest memory, it kills the guest poison a little bit and returns it to the host and just let the host write or read the poison values and essentially just proceed and not fail. One of the pros is that it means we can just tip our long-term gap in fame working and we don't have to do any changes on the core MM side, essentially. The problem is it's quite a bit of complexity that is added at the L2. I think the guest being blamed for something it's not responsible for is not necessarily the cleanest option. And it also means that potentially KVM will be made aware that the guest has been killed but completely asynchronously. It's only the next time you'll try to do a VCP run hypercall that you will see, oh, well, this guest is gone. What happened? And it's way too late to actually know why you've been killed. It makes things really hard to debug. All right, so coming back to our little guy here, the last thing we've done, we need to do is tear down the guest. We've created the guest, run the guest, and now we're tearing it down. The tear down procedure is, again, just the host issuing a hypercall to say, I no longer want to use that guest. The hypervisor has to do a few things before it can let that happen. One of them is check that there are no low VCP while this is happening. And then we need to take all of the pages that have been donated to the guest and return them back to the host so they can be freed into the memory management system. Obviously we need to poison the pages as we do that because otherwise we might risk exposing guest secrets that are in those pages. As a reminder, EL2 is non-preemptible. So we just cannot do poison all of guest memory in one go and just take minutes, non-preemptible at EL2. It's just not happening. So we had to do that in a slightly more complicated way. So assuming that we have a system like this where we have Android and which is our host, I'm using Android as an example here, but the host is just a host OS on the left-hand side and the guest. And then our host tries to tear down the guest. What will happen is the guest will notionally disappear but we will put the pages into a state which is host page painting reclaimed. And those pages still at that point have not been touched by any sort of way. And then we will let the host issue hypercalls to reclaim the pages one by one. When that happens the hypervisor will map the pages using fixed map, clear them and then return them to the host and so on and so forth until everything has been reclaimed. Which means that we can reschedule between all of those hypercalls and make sure that we don't have a really long non-preemptible section. All right, that's kind of the end of the story here. There are a lot of things that I have not talked about. There is a lot more that there's just no time to talk about it, but obviously there are considerations around talking to the secure world, which is a big big ping in the ARM world. We need DMA protections, we have IOMU support, some IOMUs into into PKVM, Intel Timers, Power Management, TRNG and there's probably a lot of other things that I just couldn't talk about. And to finish, I just wanted to mention a few limitations of the past series that we have right now. We are currently missing features, even for non-protected guests, so things such as dirty logging or read-only mem slots and things like that. We are working on a lot of those things. This is really things we intend to have to have fully work at the very least for, I mean, definitely for non-protected guests before we can land this stuff. Some of the other things we don't currently support is huge wages, also very much on the to-do list. We don't support Kexec or things like device assignment yet. And with that, I will thank you all and open it up for questions. All right. Could I ask the virtual questions first? What's the plan to support SMMU in PKVM and how to manage the page ownership after SMMU? Yeah, so we definitely plan to have SMMU support. There is some work that I know of that's ongoing to get that done. I'm not sure if anything has been posted on the list yet, but I will expect that to happen eventually. In terms of page ownership, the way, I mean, for the SMMU is relatively simple, which is almost the same thing as we do for the host. We just have to, it also depends on whether or not we can handle SMMU falls directly at TL2. But it's not fundamentally different. Every time we have to update the host page table, we would need to go and walk the stage two page tables for the SMMU page table, sorry, and update the page tables accordingly to match what we've done to the host. And the other question is, is there any performance implication that the host now runs in a stage two translation just by this translation itself? And how noticeable is this? Yeah, so there is potentially, I mean, theoretically, the main overhead is going to come from, I mean, there are two things. It's going to be TLB pressure and the cost of TLB misses. But essentially, which is essentially the same thing. But yes, by having an additional layer of translation, it means we need to take more, we use more the TLBs, we put more pressure on the TLBs. One thing that's interesting to note, however, on that front is that the host owns the vast majority of memory. And all the host mappings are identity mapped, which means we're in a position where we can very easily use really, really large block mappings to cover everything. So we can use gigabyte mappings to cover pretty much all of memory during boot, which makes the overhead of the stage two practically impossible to actually measure. So yeah. Okay, so I have a question. It seems like the teardown process has some overhead because you need to do the, yeah, that's exactly that. Yeah. How much of an overhead is that? And have you considered doing that asynchronously? Well, so far, it's not been too big of a problem for us. And that's, I think that's mostly because, you know, the main users of this so far are like mobile use cases, and we don't have extremely large machines with like the bytes of memory or something. It's like a gig. So it's not the end of the world. I think if we wanted to scale this, yes, then doing things asynchronously would be, would probably be a good idea. Because that's what we're trying to do on S3-19 now. Right. Yes, I've seen your talk. It was quite interesting. So I think the idea of just focusing that and having a separate process to the cleanup for us to hide the latency to some degree, I think would perfectly be applicable to this. And just a thing, another thing, which is like, just want to understand something. For what I understand is like, if the user from the QM console tries to like dump some memory from a secure guest, QM will die. Yes. Okay. Thanks. So regarding his observation, that's why they are going to look into the private memory thing. And my question to you, if you could go also for the record a bit more in detail as to why you need the in-place conversion. Okay. So the, one of the reasons why we're doing PKBM is to replace, well, maybe not replace, but to have a better way of doing something that we've been doing for many years. Having things that run protected from the host kernel on a phone is actually really old. We've been doing that for 15 years. And this is, it's the trust zone story. So there's a lot of baggage there, which is have a ton of history and transitioning pages to trust zone and having payloads that can in some cases use really large amounts of memory to deal with, I don't know, video playback, TRM use cases, you know, you can think of, you know, if you have like 4K video frames, you're not going to have megabytes, you know, potentially it can be hundreds of megabytes of memory that are going to be passed around between the host kernel and whatever is going to be handling that into that, that sort of secure island, which is trust zone. If we want to have some of those trust zone workloads moved towards VM, we need to find an efficient way to have that zero copy sharing like memory transfer happening between the two things. And this is why the in-place conversion stuff is necessary. Yeah. Yeah. So for the in-place conversion, I think I have an idea that I'm like 95% will work with a mapping guest private memory with the latest proposal. It's a shim around Shmem from MemFD. And the way that works is that they just buried the fops inside of another fops that doesn't wire up in MemMap. And so I think what we can do is have KVM expose or rather extend its API for how the user space VMM says this is shared, this is private, this is whatever. And we can add a third flavor that says this is shared map or not shared. This is like user space mappable and only allow that when it's not mapped into the guest. And then we can map from inside KVM so that it doesn't require a mem slot update. So we avoid any S or CU pain. Then you can do all of the filling from host user space. And then host user space can say convert in place. And if the underlying hypervisor allows that, so like TDXSE S&P don't allow that, but PKVM does, then we can do an in-place conversion. We have to zero the memory and it just naturally gets into the guest. So I'm like 95% certain that'll work. Yeah, that's great to hear. Honestly, I've seen patches floating around. I have no time to actually look at them, but it's really good to hear. Thanks. So I'm sorry, that's really a very basic question. But since you have the owner ID only in the invaded PTA, is other scenarios where you need to know the owner of a PTA in a full handler when it's valid, when it's actually mapped? So if it's mapped into the host, it means that it has access to it, so we're not going to take a fault if you have a, you know. I'm thinking about the case where the guest would actually touch that page, but now it's mapped by the host. So why would that not? Is that part, is a scenario or so? I'm not sure. So if the guest can, can hit an invalid PTA, why can't it validate a valid PTA? That's the question I have. Yes. Yeah. So we're doing that, that game with the ownership ID only for the host stage two page table. Every time we hit an invalid PTA into the guest, we're not going to map some just a memory, random pages into the guest will have to have the host tell us, this is the page that you can map into the guest at that IPA space. So, and then we'll have to do the checks that the host is actually allowed to donate that page to the guest, all that stuff. Oh, I think the key point there is the other, the thing you said at some other place that the sharing is only between two entities, never more. Okay. You attest to the guest, I'm guessing the host, you attest through some bootloader stuff in Android, but what do you do about the guest payloads, like the boot firmware and make sure that you have everything running right? Yeah. So the, the bootloader stuff. So yeah, the idea is that the host doesn't actually see the bootloader from the guest. It's the, it's the hypervisor that copies a trusted bootloader that was provided by the, you know, that was signed and whatever that gets copied into the guest and the host doesn't get involved into this at all. And then this in our first implementation, we have something that's based on you, but that has that we'd like to rewrite that in some other language that's a bit more, a bit stronger. But the idea is that we have the hypervisor copy the bootloader into the guests, force the guest to run that bootloader first, and then that bootloader will be in charge of receiving the guest payload through whatever mechanism it wants in practice for us. It's a vertio, but it could just read it from guest memory directly and then can measure the guest payload. And then that bootloader will have access to secrets that the host doesn't have access to, in order to sort of derive a signed identity and forgivable identity for the guest, which the guest can then use to say, okay, I can prove based on what the trusted bootloader has given me, I can prove that I am who I pretend to be. What's the root of trust for the trusted bootloader that gets shoved into the guest then? Sorry? How do you, how does the hypervisor know that the bootloader is loading into the guest is actually trusted and correct? So the bootloader that gets loaded into the guest is placed by the device bootloader in a memory cab out, which we just take away from the host really early during boot. So we un-map, if we have a memory region in physical memory where the bootloader dumps, you know, we dump the kernel image as well as the device bootloader and we just un-map that from the host. Recursively though, what's your root of trust? If you're using a DRM use case, and the whole point of that is someone doesn't want you decoding your video, how do you know that someone hasn't rooted your phone and is using PKVM to get at, and they've slightly modified things or something, and they're loading a... Yeah, so if you, the way these things typically work, so I'm not the best person to answer that question, I'll be honest, but the way this typically works is if you wrote the device, the keys that the bootloader can use to sign whatever payload is done are removed. So we just wipe the keys when you break the device. Thanks. And that's also how, you know, you cannot just, you know, we will never provide a bootloader, that bootloader to a non-protected guest for instance, and just load the right payload, but in a non-protected guest and just have, you know, the way this works is by just not providing the key when you come, and that gets wiped when the device is rooted. All right, well, thank you very much.