 Thank you for joining today's presentation. My name is Wei Huang. Today, Suave and I are going to talk about a new hardware-assisted BIOMMU technology offered by AMD. Here is the agenda for today's presentation. At the beginning, I will give a brief overview about AMD IOMMU and how does DMA remap works in this hardware. For BIOMMU support inside of guest VM, there are two types of the BIOMMU. The first one is QEMU-emulated BIOMMU. We also know software BIOMMU. And the second one is a call of today's presentation is a hardware-assisted AMD BIOMMU. In the rest of presentations, Suave will talk in details about the hardware changes for AMD BIOMMU and also the software changes in different system software components we are proposing. At the end, we will open up the summary and for the discussion. Okay, so this slide shows you the AMD IOMMU design. As you can see on the right-hand side in the diagram, the hardware use a lot of different data structures for the IOMMU management. And the main registers are inside the two 4KB MMR region. The first 4KB contains, for example, a base and config register. Those register point to the starting address of those data structure. And the second 4KB MMR region contains head and tail register that point into inside those data structure. For example, the head and tail register into command buffer and PPR logs. The main functionality for IOMMU is to do the DMM remap. So AMD IOMMU support two type of IO page table. The first one is host IO page table. This is being used currently by Linux DMA API and VFIO. We call this page table V1 page table and this table does a translation from GPA to SPA. And the second page table is a little bit different. It's called V2 table and it does support the x86 CPU page table format. This page table is currently used by Linux KFD driver. So as we know, IOMMU offer a lot of benefits. You can do device pass-through. You can enhance the IO security by using DMM isolation. So over the time, and user or customer always want to enable this feature for their guest VM. To support this feature over the past several years, the community has come up with different solutions based on different device model. For example, we do have Intel IOMMU support in the QEMU. Similarly for ARM SMMU, we have the same capability. For AMD IOMMU, we do support pass-through for the immutated device. But for the VFIO PCI pass-through, this project is still working progress. Recently we just submit a series of patches to enable this feature. To use this feature is very simple. When you start the QEMU command, you just specify AMD IOMMU device and add that the VFIO PCI host device and attach to this AMD IOMMU. And that should be it. For pass-through device inside a guest VM, they can actually be classified into two categories. In the next slide, I will talk about why we differentiate them from each other. The first category is called immutated PCI device. For example, like Intel E-1000NIC inside your guest VM, and the second one is PCI pass-through device, like VFIO pass-through device. For example, you want to pass through a 40 gig NIC into your guest VM, and in the guest VM, you want this device to be managed by AMD IOMMU driver. Both devices are supported and for the DMA remapping and interop remapping. Next one. But the implementation for DMA remapping actually are a little bit different. As you can see on the right-hand side, for the immutated PCI device, all the DMA coming up will be managed by immutated IOMMU. And this immutated IOMMU maintain immutated host table that does a translation from guest IOVA to GPA. But for PCI pass-through device, because this device actually is managed by host IOMMU, the real IOMMU hardware, the software needs to create a shuttle host table that does a translation from guest IO to SPA. Obviously, this shuttle host table maintenance creates some performance hit. Imagine that any time the guest IOMMU driver updates its page table, we need to reflect that change in the shuttle host table used by the real hardware. In the meanwhile, there are a performance hit. For example, we need to immutate guest IOMMU driver access to its MMIO. On top of that, the IOMMU command processing and event log, event and PPR log access also needs to be immutated. Combine those things together. We can imagine that the performance for PCI pass-through is not great. Next one. So to solve this problem, AMD come up with a new hardware-based approach. It's called Hardware VIOMMU. This hardware new feature tries to solve the problem. We just mentioned in the last slide. For example, for shuttle host table, we want to utilize the nested IO page table instead. And for the MMIO register access, we want to allow guests to directly access those registers. In the meanwhile, the hardware, the new feature, create a private mapping, allow the hardware to access guest IOMMU command directly. So those new hardware features will solve the problem we just mentioned in last slides and will beneficial for end user who want to pass-through the native device into the guest VM. There's ongoing development effort around different areas in the system software components which will be covered by Swabi in detail later on. I want to go back a little bit about nested IO page table support. So this nested IO page support is different from shuttle host page table support. I just mentioned in the last two slides. In this new design, the guest IOMMU driver will use the V2 table, the guest table that does the translation from guest IO VA to GPA. And the host IOMU actually will use a host table that does a translation from GPA and SPA. So you can imagine that this nested IO page table is very similar to CPU side of nested paging. So we expect this will be able to solve some performance hit we mentioned in the previous slides. Now putting these two design together, we create a hybrid system model depending what kind of pass-through device you're going to use. If we want to use emulated PCI device, we will go back to use the current software base approach, the emulated IOMU that goes, DMA goes through the emulated host table as shown on the right hand side of the top part. If end user want to pass through a PCI device, then they will use a nested IO page table. And that requires adding a new MD-VIOMMU device model. So by far, I just give you a brief introduction about the MD-IOMMU and the hardware-VIOMMU. The rest of presentations, Ravi, we'll talk about the detailed hardware design and the changes proposed in different systems of your component. So Ravi? Hi, so in the next session, I will start discussing the detail of the changes for the hardware-assisted VIOMMU feature. Starting with the hardware changes, we're introducing the IOMMU private address space. This address space is used by the IOMMU hardware to access per-guest IOMU data structures On the left hand side, there's a diagram showing the layout of the private address space. And you can see starting from the top, we have the event and PPR log, the command buffer, and the guest MMIO registers at the bottom. Also, it is used to access the VIOMMU specific data structures, which are the domain ID mapping table that maps the host domain ID to guest domain ID. And the device ID mapping table that maps the host device ID to the guest device ID. And this is for the pass-through device, the PSI pass-through device that used the VFIO. Also, another data structure is the command buffer dirty status table. And all the mapping here is being done using the IOMMU v1 page table that maps the private address to the system physical address. And it can support up to 64K VMs. Next hardware changes are the introduction of the VF and VF control MMIO bars. So the bars are used mainly by the Hypervisor to access the per-guest MMIO registers. As Wei mentioned earlier, we have two set of MMIO registers. One is used for control and another one is used for hit and tilt pointers of the data structure inside IOMMU. So the control one is basically accessed via the VF control MMIO bar. The pointers ones are accessed using the VF MMIO bar. And the diagram on the right-hand side just show the breakdown of the regions of the bar. You can see that it's split into different 4K regions indexed by the guest ID. Same for the VF control MMIO. So next hardware change are the introduction of the new IOMMU commands and events. For the event, typically when an IOMMU encounter an error, it will generate event locks into the event buffer. And in this case, when we have the guest IOMMU, the errors inside the guest will actually be locked in the host IOMMU. So in order to be able to identify that the event belongs to the guest, we're introducing bit fields for the existing IOMMU event. And it listed here. So next will be the IOMMU host driver. We process the event lock. And if it wants to inject that event into the guest, it can do so using the insert guest event command, which will place the event into the guest event lock. And another scenario is when we run it to errors that related to VIOMMU specifically, the highway will lock the VIOMMU highway error event. And usually this will cause the guest to fail. And when we try to re-initialize and recover from that failure, there's a command to reset the VMMIO, sorry, there's a command called reset VMMIO that will reset the guest MMMIO registers. Next are changes to the host IOMMU driver. And starting from the boot time initialization, first we add the logic to detect and enable the VIOMMU feature in the hardware. Then we set up the IOMMU V1 page table for the private address mapping. The first part also requires allocating map, I'm sorry, allocating and map the tables that listed here. We have the domain ID table, device ID table and the command buffer dirty bit status table. And those are listed on the right hand side here for the boot time. Then we have the per VM initialization code that is going to be used when we launch the VM. And basically it will needs to create mapping for the private address to the SPA mapping and that's basically the black arrows that shows here for event and PPR lock for command buffer for guest MMMIO registers. Also we need to do host to guest mapping for the device ID and the domain ID. And we also need to trap, set up for trapping when the guest try to access the first 4K of the MMMIO region, which is the control regions of the MMMIO. Also we need to add code for supporting new IOMU commands and events. So the next part are changes to the QEMU. Basically the change are introducing the new hardware VIOMMU device model, which is different than the current software emulated VIOMMU model. So on the right hand side shows an example of a guest VM where we pass through three NIC devices. NIC 1 and NIC 0 is part of the IOMMU 0. And NIC 4 is actually on IOMMU 1. So when we pass through these three NICs into a same guest, we actually need to create two VIOMMU instances and associate them accordingly. Because the hardware has to be able to virtualize all the command buffers and all the data structure belongs to that the NIC is associated to. So this require changes to several things in QMU. The first thing is we need to be able to specify more than one IOMMU in a guest. Currently with the emulated IOMMU, we only can support one. We also need to be able to specify PCI to policy for the VM and associativity with the device. For example, at the bottom here shows the command for QMU to launch the guest with two VIOMMU. The red one is for the IOMMU 0 and the green one is for the IOMMU 1. And to be able to support that, we also introduce new device FS and Iocto interface. The new device FS is called AMD VIOMMU and this is implemented by the IOMMU host driver. Also the new VIOMMU specific Iocto interface listed here to support all the operation that's required for setting up VMs, device domain and all the memory MMIO access that's needed to be trapped. The last part is the change for the guest IOMMU driver. As Wei mentioned, we now making use of the Neset IO page table which required the guest to be using the V2 table to do the guest IOVA translation to GPA because the first table or the, sorry, the V1 table is already being used by the VFIO to do GPA to SPA. The problem is currently for IOMMU driver, we only support V1 table for DMA API. So we need to make some changes here for the guest driver and the changes are split into two parts. First is to refactor the current code to use the generic IO page table framework that has been introduced and currently used by ARM. The code has been submitted upstream already for review. Next part will be adding support for the page table, sorry, for the V2 page table for DMA API and it's also gonna be using the same generic IO page table framework. This is work in progress. So to summary, so far we've been comparing the software versus the hardware VIOMMU and try to show how the hardware VIOMMU supposed to help improve performance for the guest IOMMU for the pass through device by instead of using the host page table, we're now gonna be using the nested IO page table. And now guests can directly access MMIO register instead of having to be emulated by the hypervisor. Last part is all the emulation that needs to be done for command buffer, event buffer and PPR buffer. Now it's being ex-related by the hardware. So for the end, we would like to start some discussion with the audience around a few topics here. First is we would like to get some feedback on the new hardware VIOMU device model that we are proposing. Second part is the hybrid VIOMU system model where we have both software and hardware VIOMU in the same guest. How that will scale and how would that be included in the QMU or there's pros and cons, we would like to get some feedback on that one. Next one is the IO interface design because we're actually using a brand new IO interface. There were some discussion on whether we should extending the existing VFIO IO interface or not. And for the last part, we would like to get some feedback on additional usage models based on the proposed design that we have presented. Thank you.