 Hello everyone, this is Yu Zhang from Intel's virtualization team. And my topic today is unmapped guest memory. Before I give this presentation, I'd like to say that many ideas are based on works of period should move on the show on Chris Dorfson. So the credit should go to them. Okay, I say agenda of this session. First, I'd like to talk about the background. For example, why I mapped the guest memory is necessary, especially in Intel's TDX architecture. Next, I will summarize the repairment of guest I'm mapping. For example, why is it possible? Then we will think about the basic targets of guest I'm mapping. I mean, what do we want to achieve? In the end, I will describe different design options which were discussed in the community. Actually, some proposals are still under discussion when I'm recording this video. And it's very likely there will be more information when you watch this video. And hopefully we can get some agreement by that time. So, about the background, as we all know, mapping guest memory into host user space is a common practice in current QVM. Such design provides convenience for the host to perform accumulation. For example, QVM needs to access guest memory to work with people and to decode as a guest instruction. A device model like QVM need to visit guest memory to do the emulation and some types. Guest memory is shared by QVM to the host users like dvdk and spdk to perform the DMA operations directly into guest memory. And guest memory that needs to be accessible by the host or emulation logic can be considered as shared memory. Otherwise, it can be regarded as private. However, since the host has no idea which part of guest memory will function as shared, the most convenient way for QVM is to just map the entire guest memory into each user space. For example, cumulus virtual address space. Also, coupling with HPA and GPA in QVM's memoslots facilitates the GPA to HPA translation. The QVM MME page called Hangular can just call getting some pages to get HPA once the HPA of this GPA is identified. Well, with the advent of security features like Intel TDX and MDS EV, QVM is no longer inside the TCP and the host shall not be able to access guest private memory. Okay, let's take a look at Intel TDX. Intel TDX uses enhanced mptme engine to enable memory encryption and meanwhile provide memory integrity protection. With this enhanced mptme engine, we can prevent several text analysis from untrusted software. For example, when the host is trying to read any guest private data with its shared KID, zero data instead of encrypted one will be written. Also, it can prevent data modification without detection. That means we need to ensure the TD guest will read back the same data, which was last written to its private memory. So, if a host accidentally read to any guest private pages using the shared KID, co-optated message authentication code will be generated. So what is the message authentication code and what will happen if the TD consumes some data with co-opted MAC? Okay, message authentication code MAC is a 28-bit integrity check value associated with each cache line in memory. When the cache line data is written to memory, mptme engine will encrypt the data with appropriate encryption key. And then it will compute a MAC value over the several text, which will be stored in the ECC memory as MAC data. In addition to the 28-bit MAC, there is a TD owner bit, which is also used when calculating the MAC value. And it is also stored separately along with the MAC. TD owner bit indicates whether the data belongs to a TD or not. It is set to one if the write operation is using a private KID, otherwise to zero. Later, when the cache line is loaded from memory, mptme engine will check the TD owner bit first and then verify the MAC value. The cache line will be mapped as poisoned if the integrity check fails. And on subsequent conception of the poisoned data, there are two possible scenarios. And if the call determines that the execution can continue, it will treat the poison as a machine check exception. But if the call determines execution cannot continue, it will do an unbreakable shutdown. For example, the CPU may not recover from a machine check exception if the conception of poison data happens in some complex microcoded instruction, which involves multiple memory accesses. So in the worst case, incorrect or malicious rights to the TD private memory may lead to system crash. And our solution is to map the private pages to one specific TD owner. And that means we cannot map private pages into multiple guest. In TDX, it can be guaranteed by the same model also known as TD Intel TDX model. Also, that means we shall not map any private pages into a host user space. After all, if host is not allowed to access the private page, why should we keep this mapping set up? As mentioned earlier, the need to map our guest memory into a host virtual address space is because binding HVA with a GPA offers convenience for QVM to get a HPA. But this is only a design choice of QVM. It is not a prerequisite for the GPA to HPA translation. Another reason is because the host has no idea which part of our guest memory is shared. As long as QVM can perform the GPA translation and as long as QVM knows the address of the guest shared buffers, it does not have to map all the guest memory into its user space. So what memory shall be considered as shared? No, what memory shall be private? Well, in TDX, since the instruction decoding is done inside the TB, there is no need for QVM to work guest page table to fetch any instruction. So these pages, along with the guest normal data pages, shall no longer be accessible to the host. Actually, the shared pages shall be very limited. The ones I can imagine are the PV clock pages or some of the guest DMA buffers, such as the software LTIB buffer, and the ones allocated by DMA direct dialogue. So the remaining questions are how to map guest memory on the meanwhile maintain GPA to HPA translation. Okay, let's take a look at what we want about guest mapping. First, since only guest knows which part of its memory is shared, it is a job of guest to notify host about shared pages, whether explicitly or implicitly. And by guest, it means both guest kernel and virtual bouts. In TDX, it's also called a TDVF. Second, we do not want QVM to manage the host page table directly. The task should belong to the Linux memory management subsystem. And if necessary, the memory management subsystem can also choose to map guest private memory in its direct mapping. For QVM, it needs to take care of the transition between sharing and unsharing our guest page. Another responsibility of QVM is to maintain the GPA to HPA translation. And if we want our solution to be also applicable to non-TDX environments, QVM or host kernel should also take responsibility to guarantee that one-on-one association between guest private page and the host physical one. So that means HPA shall be only assigned to one specific GPA of one guest at one moment. And of course, support of the normal VMs shall not be impacted about the design options. For now, there are two design options. The first one is posted by Curio and we call it struct-based guest mapping. In Curio's patch, guest memory is still mapped when the VM is created. Later, the protection can be provided as long as host cannot use HPA to access the physical page. So eventually, the PFN information will still be kept in host PTE just with, for example, with present-bit player. So when host user space tries to access the private page with one HPA, a page fault will be generated and the SIGBUS will be sent to the user space. To achieve GPA to HPA translation, new flags of categories are produced or introduced so that an HPA can still be returned to our QVM MMO page fault handler. And this flag shall be used by QVM only. But for one-way association between the GPA and the HPA, we can only show how the TDX model. I mean, the cost of adding extra information in struct page or introducing a whole new data structure is actually not affordable. Another disadvantage of this design is lack of support for memory that isn't backed by struct-bit. Well, to map guest memory, the struct-bit-based solution has two different sub-versions. The first version leverages an existing flag in struct-bit, the hardware-poisoned flag. In this version, the HPA is unmapped in host page table by QVM with a fake swap entry. And in order not to overwrite existing hardware-poisoned flag, the second version introduces a new struct-bit flag, the page guest. So this version also adds some new AMP protect flags and VMA flags so that a VMA range can be mapped as protected. Another improvement in this version is that the host page table is updated by Linux memory management subsystem when answering the AMP protect system call. Okay, here is a flow of the struct-bit-based solution. Once the memory protection feature is detected by the test, for example, with CPU ID, the VCPU can trigger a hypercall into QVM to mark a GFN range as shared or private. A new exit reason will be used by QVM to forward the hypercall into QVM, which invokes AMP protect system call into Linux kernel. The Linux memory management subsystem will update the host PTE in the system call handler. Later, if an EPV violation happens on a private page, QVM MMU page 400 will use a new flag, for example, follow guest, in get-use page, so that a PFN can still be generated and be put into the SPTE. There's a second design proposal is from Sean Christofferson in this proposal, and the guest-private memory is backed by an enlightened 5D scriptor. This 5D scriptor shall be a dedicated one, meaning it shall not be shareable between multiple processes. And it also shall be private, meaning it cannot be mapped into any user space. Also, some extra flags may be necessary to convert the entire file to private memory and to truncate the file size. Well, this is a brand new design, which decouples the TDP translation with host page table. But actually, there was a similar idea proposed by Isaku in QVM from two years ago. So maybe we can take a reference from that topic. About the GPA to HPA translation, Sean proposed to introduce new MAM slots for the guest-private memory, possibly with a whole new address space. And in this MAM slot, private operations are needed. For example, operations for the back install are expected to translate a GFN to PFN based on the file offset, and to get the TDP mapping level. Meanwhile, QVM may need to offer operations to Linux library management subsystem to support the validation and swiping or migration of a GFN range. Compared with a struct-based solution, this solution has several advantages. First, with FD dedicated to one specific guest, and with no HPA mapping. The 1-1 association between host physical page and guest page can be maintained with no dependency on the TDS module. And since the host page table is not populated by the guest-private memory, the host will have a small footprint. Also, with no fake swiping entry in host PTE, the chance of revealing a private PFN would be lowered. Another advantage is that we do not need to override any existing struct-based flags to introduce a new one, because this design does not rely on any struct page. So in the future, this solution may be easily ported to support memory that isn't backed by struct page. And of course, the benefit is not free. The design requires significant changes in QVM. For example, new IO control interfaces may be needed to handle the transition between private to shared for a GFN range. And the shared GFN range could be small and scattered. Also, the new GFN to PFN translation approach relies on the enabling backing stop support in Linux. And another impact might be on VFL, which also leverages the guest user page to ping guest memory. And if a signed device is to be supported, we can imagine some new VFL interfaces are necessary to perform the DMA mapping without any HVA information. For now, the discussions on this FD-based solution is still ongoing in the community. Okay, short summary of this session. A mapped guest memory is not only possible, but also desirable. Enhanced security can be guaranteed with guest cooperation. And the one-way association of cost page and the best one is doable, even inside QVM, but with cost. So if you are interested in this topic, please feel free to join the discussion. Let's make the cost affordable. And in the end, I'd like to say thank you to all the people who have been working on this and joined the discussions. And since the discussion is still in progress, some names may be missed here. But anyway, I would like to say thank you all. Also, thank you everyone for joining this session. Thank you.