 Hello everyone, my name is Zhao Yan from Intel Corporation. Today I'd like to share with you about sharing IOU page tables with TDP in KVM. This topic is prepared by Lu Baolu, Jiang Kevin and me together. This is the agenda for this topic. Firstly, we'll state the goal for sharing, its advantages, prerequisites, interfaces required, and then we'll introduce an important concept of page and page table pinning, which is required by DMA in IOMMU site when IOU page fault is not enabled. And as TDP route update is frequent in KVM, we'll show how it is doing now with sharing enabled. And then we'll show some rough boot up performance data and the to-dos for us in future. This is the goal for sharing. As we can see, currently the IOMMU site is using IOPT for translation from GPA to HPA when VIOMMU is not enabled. And in the CPU site, the TDP is used for translation from GPA to HPA. The two sets of page tables are essentially duplicated. So our goal is that we want to share the TDP in CPU site to be used as IOPT as well in IOMMU site in order to take below advantages, reduce the memory footprint, unify page table management for dot-page tracking, page fault handling, and something else that cannot be listed here that may achieve higher performance by reducing unnecessary details. The sharing should have below three prerequisites, that is the same address space, compatible page table format, and non-conflicting page table content. For the same address space, currently the only supported address space for sharing is from L1GPA to HPA. So the QMU needs to check with the KVM site that where the TDP is enabled and ensure that the VCPU model does not include EPT or MPT feature. And at the same time, in the IOMMU site, there should be no VIOMMU enabled. And if there's VIOMMU, it should not be in shadow mode. But if it is in nested mode, it is okay that if the second level page table holds the mapping from GPA to HPA. For nested VM, currently the sharing is not supported, because we're not sure whether the two sites are always using the same address space. For the SMM mode in X86 platform, as it is using a different address space that cannot be shared to IOMMU. So when a VCPU is entering the SMM mode, its previous known SMM mode EPT must not be destroyed and must be kept for sharing to the IOMMU site. For the compatible page table formats, in order to check whether the two sites are using a compatible page table formats, firstly we need to have a unified definition for the page table format. They can be named as format EPT level 4, which means the EPT is used for sharing with four level page structures, or it can be format MPT level 5, meaning MPT is used for sharing and five level page structures is used. So this is the handshake sequence for format negotiation. Firstly, the QMU needs to ask KVM site. It's currently shared page table format. For example, KVM can return EPT level 4, then the QMU will check whether this EPT level 4 is compatible with the page table format used in IOMMU site. Take Intel IOMU for example. If it is currently using first level page table only, which is not compatible with the EPT format, it should return failure. But if the Intel IOMU is configured to use second level page table and is also using four level page structures, it can return success. Then the QMU can allocate an IOMMU with format EPT level 4. And when there are devices attached to this IOMMU, the IOMMU can request KVM to share its TDP. The KVM then will share the TDP used by VCPU0 as the IOPT to IOMMU. For non-conflicting page table content, first we'll look at the page table entry present. If a GFN of a page table entry belongs to a KVM user memory slot, this PTE must be present and pinned if the pages are used for DMA pages when the IOPT fault is not supported. But if this GFN is not used for DMA purpose or if the IOPT fault is supported, then the PTE can be present or cannot present. When a GFN of a PTE belongs to a KVM private memory slot, this PTE is not present in IOPT before sharing, but it is safe to be present in IOPT after sharing. And currently there are only three private memory slots. The first is the local APIC private memory slot. The DMA write to this different range doesn't go through DMA remaps. And for TSS and identity page table private memory slots, they are only enabled when the unrestricted guest is not installed. And because they are reserved in E820 guest, the DMA write or read to these ranges are safe, just as the VCPU kernel mode access. For the read write and execute bit, the two sites should also follow the same policy, that is read only for GFN ranges in the read only memory slots, and read write for GFNs in other memory slots. The VCPU can also set the execute bit on, but this bit is currently ignored in IOMMU as no device is using it. For the write protection for live migration, if the IOPT fault is supported, the write protection is allowed. But if the IOPT fault is not supported, the write protection must be disabled. So for dirty bit, either all the pinned ranges are set dirty, or a transversal of the page tables are required for the dirty bit. This is the interfaces required for sharing. After the cumule attach device to the IOSAID, which is shared from TDP, the IOMU side need to first call the request sharing interface to the KVM. They need to specify whether PIM is required and the register is notification callback. And there also should be pin and unpin interface when the DMA is without IOPT fault support. We'll discuss about this interface later. And the third interface is the page fault interface when IOPT fault is supported. The IOMMU side need to forward the page fault into the KVM side when there is IOPT fault. And when the KVM side update is TDP root content, root or content, it also need to notify the IOMMU side in order to update is IOSAID root and flush the IOPT. This is the concept of page and page table pinning. So when there is no IOPT fault in the IOMMU side, in order to avoid the DMA fault, all the pages used for DMA need to be pinned using the PIM user page API with long term parameter, with long term as flag. And also the TDP pages, TDP entries are also need to be pinned. So it means we need to do the pre-population for the pinned ranges for DMA. And there should be no ZAP and PFN update for the pinned ranges. And if the page table pages have parent linked, they cannot be reclaimed. So when there is a request to update the PTA permission or the page size is changed, the TDP entry must be updated. And this update must be an atomic update. Here shows how the atomic update is done. When there are stability in huge pages and updating of PTA entries first, we need to prepare the substituting PTAs first. And then the old TDP entry must be updated from an old non-zero value to another non-zero value. And this is not the case in KVM currently. We need to do a lot of code change to KVM. So the page and the page table pinning interfaces for the IO page fault are not supported in the case. If we want to pin all ranges in user memory slots, we can do that in the memory slots at interface. But if we want to only pin a specific range for DMA, we need to introduce a separate pin and unpin interface. We can do it in two ways. The first way is provide an UAPI in KVM. And the QMU calls the pin and unpin for DMA, of course. This way has a pro that the QMU doesn't need to call the map and unmap into the IO ACID. Of course, the IO ACID is currently using a third-party TDP. The second way is to provide a kernel-mode interface for pin and unpin. The IO MMU calls this interface into KVM. In this way, the QMU still needs to call the DMA map and unmap into IO MMU. But this way, as the IO ACID is using the third-party API, the call of DMA map and unmap is not that good. But this way has a pro that it is straightforward to hold more IO MMU side info on the pin and unpin API. For example, we can specify whether to set the snoop bit in the TPN tree. When there is a TDP root update in KVM, we'll show how it is doing now when sharing is enabled. We can see from the left side that if two VCPUs are using the same row, then they will reference the same MMU root page. If a VCPU is using a different row, it will reference a different MMU root page. And as we share the TDP of VCPU 0 to the IO MMU side, we will increase the root count of MMU root page used by VCPU 0. And when the VCPU 0 want to switch its MMU root, the right side diagram shows how it is doing now. At first, the VCPU 0 need to call the KVM MMU onload to decrease the root count of the old MMU root page. And then it can call the KVM MMU load to increase the root count of the new root with the new row and update the EPTP in VMCS. After that, it need to check whether the new row is for SMM. If not, SMM is not SMM mode TDP, so it's safe for sharing. Then we will increase its root count by one and do the prepopulation of TV entries for PIN deranges and call the root update notification into the IO MMU side. The IO MMU will update its IOBT root. And after that, it will decrease the root count of the old MMU root page. And after that, the old MMU root page and its children pages can also be safely destroyed. Here is the boot up performance. It's just a rough performance data without any optimization yet. In this implementation, all the VM pages are pinned on user memory slots, creation and deletion. And when the VCPO 0 is switching to a new root, or when there are memory slots add and help series page is splitting, the TDP was prepopulated. And we also call the IO TLB flush in IO MMU side. And this step takes around one second. From the table, we can see that the baseline data without sharing for boot up is 29 seconds. And after sharing, if huge page is enabled during prepopulation, it takes 32 seconds for boot up. The extra two seconds is taken by the 132 times of prepopulation. The time is even more if the prepopulation is with huge page disabled. In concept, after sharing, the boot up performance can reach as before sharing by reducing the TDP root update count. So here's the to-dos for us in future. We want to figure out how the Snoop bit can be set in the TDP side. And what is a unified way to track the dirty page. And we want to support nested VM and do some performance optimization. For example, to reduce the page table root update count and support the huge page for P2P something else. Thank you.