 Hello everyone, this is Yu Zhong from the Interest Virtualization Team and my topic today is virtual LMMU with cooperative DMA buffer tracking. Okay, here is today's agenda. First, I'd like to talk about the background of static pinning and indirect IO and the problem of static pinning and our BLMMU. And why do we need the DMA tracking? Next, I will introduce the concept of virtual LMMU with cooperative DMA buffer tracking for direct IO. And in the end, I'd like to discuss upstream considerations of co-LMMU. Okay, as you may know, direct IO is the best performance IO virtualization method widely deployed in cloud and data centers. By assigning a hardware device to the virtual motion, the guest can perform DMA operations directly without the need of host intervention. At the host side, HiPyzer programs the hardware LMMU pitch table to provide the inter-guest protection. However, direct IO faces a problem of static pinning due to two reasons. First, most devices do not support DMA pitch fault. That means DMA buffer needs to be pinned in hardware LMMU before the DMA operation. Well, above pinned, we mean the DMA buffer need to be pre-allocated and mapped in our MMMU pitch table. Second, since HiPyzer has no visibility of the guest DMA activities, it has to assume that all guest pitches could be used as a DMA buffer. Therefore, the HiPyzer will have to pre-allocate and map the entire guest memory in hardware LMMU when the VM is created. And all the guest memory shall be pinned during the whole lifecycle of the VM. Well, the problem of static pinning is quite obvious. We have to tolerate the much increased VM creation time and also greatly reduce the memory utilization features. For example, advanced features like page migration, memory overcommitment, data allocation under swiping is not possible with static pinning. Well, one possible solution of static pinning is to expose a virtual LMMU to the guest. The primary purpose of virtual LMMU is to provide intro guest protection with virtual DMA revamping. Well, as a side effect, found grid pinning is possible with VLMMU. Well, when the DMA revamping is enabled in the guest, the guest will use VLMMU to map and unmap DMA buffers. And all these requests will be forwarded to HiPyzer to dynamically pin and unpin guest DMA buffers. In such cases, with the insight of guest DMA activities, static pinning is no longer necessary. Well, also, like virtual devices, VLMMU could be an emulated one or a power virtualized one. However, the VLMMU has its problems. The emulation cost of current VLMMU could be significant. For example, we have observed more than 96% performance downgraded in MIME-HD in the guest when DMA revamping is enabled in VLMMU. As a result, virtual DMA revamping is not used in most guest devices. For example, a VM can be created with no virtual LMMU at all. Or even with a VLMMU exposed, the VM can choose to run in pass-through mode, which disables the DMA revamping as well. Well, for guest security requirements, it varies. For example, DMA revamping is needed when an untrusted device is plugged in or when the device is assigned to its user space drivers. So although VLMMU provides an architectural way for learning guest DMA buffers, it is not a reliable solution to achieve fine-grained pinning. Meanwhile, we argue that the requirements of protection and pinning through the same costly DMA revamping interface is needlessly constraining. Because intro guest protection is an optional guest-side requirement. Whereas fine-grained pinning is a general host-side requirement for efficient memory management. The host needs the capability to efficiently track guest DMA buffers. So how about we decouple the DMA tracking and DMA revamping interfaces in VLMMU? That means we want a separate DMA buffer tracking mechanism in VLMMU without relying on any semantics of DMA revamping. And if this mechanism is efficient enough, we may expect most guests to always enable fine-grained pinning. For this DMA buffer tracking mechanism, we expect it to be orthogonal to DMA revamping. The enabling of DMA buffer tracking shall not affect the desired protection of DMA revamping. Also, DMA buffer tracking should incur only negligible costs. The performance expectations under different protection policies shall be sustained, non-intrusiveness. We try to minimize the changes in the guest stack, software stack. Well, such solution should be widely applicable. I mean, it shall work with all kinds of IO devices and it shall be easily ported to different VLMMU implementations. Extensible. The solution should be extensible to help address other limitations in memory management. For example, it can help to track the DMA pages during the VLMMU integration. Here, we propose a cooperative DMA buffer tracking as a pyroverterized interface. By cooperative, we mean bi-directionally shared information between the guest and the host about the DMA buffer information. And the fundamental information shared is the pinned status and the mapped status for each guest page. Well, the host tells the guest whether a page is already pinned. Hereby pinned, I mean, the page is allocated by the host and the mapped in hardware. And the guest tells the host whether a page is mapped by its DMA API. So with this information, we can minimize the number of VLMMU exits when guest maps pages through its DMA API. I mean, the page pinning requests are only needed for guest pages that are not pinned yet. We can also eliminate the number of VLMMU exits when guest maps DMA page. Also, our solution enables flexible host memory management policies. For example, the host can unpin any guest pages which are no longer DMA maps. Here, we name the LLMMU with such design as cool LLMMU. A virtual LLMMU with cooperative DMA buffer tracking in direct IO. Okay, this page is about the architecture of cool LLMMU. First, we introduce the DMA tracking table, the DTT, to hold the shared DMA buffer information. For example, the pinned status and the mapped status of each DMA. And in the guest, the cool LLMMU driver is hooked to guest DMA API layer. Actually, it's just a virtual LLMMU driver with our PV extensions. So this driver intercepts the DMA API operations in the guest and updates the DTT accordingly. For pages not pinned yet, the driver will send the page pinning requests to the backend. And at the host side, the cool LLMMU backend handles the page pinning requests and updates the DTT, setting the pinned status for the GFN. Also, the host cool LLMMU backend is synchronously unpinked pages based on the mapped status in DTT. Okay, let's take a look at the organization of the DTT, DMA tracking table. Well, as you can see, it's a hierarchical paging structure shared between the guest host and the guest indexed by GFN. So for each GFN, there is a tracking unit. Each tracking unit holds the DMA buffer information such as mapped status, which indicates if a page is currently mapped by guest DMA API. It is set and cleared by the guest DMA operations. Under the pinned flag, which indicates if a page is already pinned by the host. It is set and cleared by the host after a page is pinned and unpinned later. The accessed flag, which indicates if a page has been used for DMA recently. It is set by the guest DMA mapping operations and cleared by the host periodically. Also, there are five reserved bits, which can be extended. For example, we can add a dirty flag to assist the dirty page tracking in that migration. Okay, let's take a look at the process of guest DMA mapping operations. For the DMA map, the guest core ML driver will set the mapped and accessed flag in DTT entry for each target GFN. Meanwhile, it will check the pinned status of this DMA page and the pinning request is necessary only when the pinned flag is zero. The good news is that we found more than 99% of pinning requests can be avoided thanks to the DMA buffer locality. So it is very likely that a recently used DMA buffer will be reused in the following DMA operations. And when the guest unmapped the DMA page, the core ML driver just clears the mapped flag in DTT entry for the target GFN. So there is no VMA exit at all. And at the host side, the core ML backend performs lazy unpinning in a separated slide. It periodically checks the mapped status of each pinned page. For each pinned page with mapped flag with being zero, we will check the accessed flag, which indicates this page is not used by guest DMA recently. So if this accessed flag is zero, we can go ahead to unpin it. And any unpin the pages can be considered as reclaimable by the host. In the end, the accessed flag will be cleared regardless of its previous status. Okay, DMA tracking versus DMA remapping. Well, we know that in the majority cases, DMA remapping is not used by the guest. And the DMA tracking with core ML could be an efficient solution to achieve funded pinning. However, sometimes the intro gas protection may be needed. For example, DMA remapping can be conditionally enabled for some untrusted devices. In current implementation, the hop-hizer must fall back to static pinning as long as there are other assigned devices which are not using DMA remapping. Another example is that DMA remapping can be enabled when the device is assigned to guest user space driver. And later disabled when the device is returned back to the kernel driver. So in current implementation, that means to switch between the static pinning and the fine-grained pinning, which leads to increased overhead because the entire guest memory will have to be unpin and pin during such switch. So in such cases, DMA tracking offered by Coal ML can help provide a reliable solution for fine-grained pinning. So what if the DMA remapping is always enabled in the guest? I mean, for all the devices at all times. Well, DMA tracking in such scenario is only optional. But our evaluations show that even with DMA tracking enabled, the performance overhead is negligible. Well, here is the implementation. Our previous POC is done by extending existing virtual VTD. Well, as you can see, there is no ad hoc changes in any guest device driver. And we believe such concept can be applied in both emulated RLM mills and the pyroverse last ones. So as to the upstream plan, we'd like to implement it in virtual RLM mills. So for virtual RLM mill, we may need a new group of interfaces. And as to other logic such as guest DMA buffer tracking in DTT, and there's a hostile pinning, we believe the same code can be reusable for different real RLM mills. Okay, I'd like to talk about our upstream proposals and opens. The first is we would like to propose a new group of interfaces in virtual RLM mills. For example, the future negotiation of the capability of fine grained pinning. I mean, the DMA buffer tracking. Also, the base address of a device bitmap. This bitmap is bit indexed by BDF of guest devices to indicate if this device is a silent one. Because our DMA tracking shall be only applied to assigned devices. Emulated devices are not our concern. Also, the base address of DTT, the DMA tracking table. Meanwhile, we definitely need a page pinning request in virtual RLM mills. Well, about the host laser pinning, we created a separate QMIL thread, which is waking up periodically to perform the laser pinning. Whereas the pinning interval can be manually configured in QMIL command line. And I suppose adaptive interval may be multi-zero. Moreover, you may notice that the current MP policy is LRU-based. So maybe we can examine more policies in the future. Guest cooperation limitations. Well, this is a headache. That means when the host creates a VM, it has no idea if QLM mill will be enabled by the guest. The same issue exists in current VLRM mills. Well, current solution is to pre-pin the entire guest memory first, and perform the pinning during the address-based switching logic, where VLRM mill is enabled later in the guest. Also, guest BIOS may use direct signal. Well, even the guest BIOS had virtual RLM mill driver, and even this driver could be power virtualized. We still face the same problem mentioned above. So I guess in the short term, maybe we will have to follow the same solution in VLRM mill, I mean by pre-pinning the guest memory and MP8 when the guest enables QLM mill. But in some scenario, where the guest kernel is controlled by the cloud provider, there is no virtual BIOS in the VM also. So things would be much simpler. Moreover, it is possible that a selfish guest may deliberately report fake DMA pages. So, in the future, we may choose to build a quota mechanism about huge page mapping. Well, in our implementation, the DTD tracks guest pages are only in 4KB granularity. And then, of course, in QMIL, the backend can choose to conduct huge page pinning by merging continuous guest pages. However, we realized that such optimization will complex the lazy MP9 logic. Well, fortunately, our evaluations show that most guest DMA workloads are using frequent mapping operations only on many scattered 4KB pages. With one exception is the GPU workload. Without huge page pinning, we observed about 4% performance drop in open arena due to LTRB missed penalties. Another issue is sub-page mapping. Well, we realized that multiple DMA buffers may co-locate in the same 4KB guest page. For example, the small network package. That implies that one guest page can be mapped and unmapped in multiple types concurrently. So, in our implementation, the DTD entry tracks the guest DMA mapping count for each guest page. So, only when this mapping count of a guest page reaches zero, will the mapped flag be cleared for this page. Well, the last open is, what if the assigned device is SVA capable? I mean, for SVA workloads, on-demand pinning is already possible. It can be done in LRM and MUPID files. So, what's the value of DMA buffer tracking for such devices? Well, the reason is that a typical SVA capable device has to support mixed workloads with SVA workloads or non-SVA workloads. And it also may submit global configuration data structures. All these are not affordable. So, a cooperative DMA buffer tracking could still be desirable. So, here is performance evaluation based on our previous POC. Well, we choose the wide range of benchmarks to evaluate the performance and memory footprint in some devices such as the 40G, NIC, and NVMe SSD and Intel GPUs. All the benchmarks show near to 100% performance compared with direct dial without VRM or DMA mapping. Meanwhile, we found that the pinned guest pages are much fewer. For example, the maximum number of pinned pages we observed is only just about 1% of the entire guest memory, which is 32 gigabytes in our configuration. Also, if we do not prepin the guest memory, we can wait in this much reduced development creation time. So, for the detailed environment configurations and performance data, you can find it in our USNICs paper. So, in summary, current VRM emails cannot reliably eliminate static pinning in direct dial. Well, co-RM email can offer a reliable approach to achieve fine-grained pinning with a cooperative DMA buffer tracking method. It dramatically improves the efficiency of memory management with negligible cost. Meanwhile, it sustains the desired security requirement in different protection usages. And it can easily be applied in various VRM email implementations. Well, and of course, as previously mentioned, co-RM email is not perfect. So, any comments on any suggestion will be welcome. Okay, that's all for this presentation. So, please feel free to raise your comments. And also, you can always mail us for any questions about co-RM email. That's how. Thank you. Thank you, everyone.