 Thank you for joining our presentation. This is Wei Huang. Today, Swabi and I are going to give you a talk on AMD hardware-assisted BIOMMU technology. This presentation is a continuation from last KVM forums talk. We're going to give you a quick recap on hardware-based solutions. After that, Swabi will talk about the software prototype details we put in AMD IOMMU driver and QEMU device. This presentation will also include performance results we collected from high-performance IOMU devices. And we are going to compare those results with software-based solutions. So before we jump into the details, this is a quick recap on AMD IOMMU technology. We all know that IOMMU does the DMA retranslation for endpoint device. And there's a lot of data structures that needs to be programmed into IOMMU hardware by the driver. One good example is the device table. In the device table, each entry describes the IOM property for endpoint device under PASU. So for example, device table entry will contain the IOMMU page table. And when the DMA is coming up, that IOMMU will fetch device table entry and walk that IOMMU page table in order to send the DMA to the right destination. All those data structures will be programmed into the configuration register in the IOMMU-MMIO region. Another good example of data structure, including command buffer and event logs, those are used for communication between the software and hardware. And the management of those data structures require us to have the head and tail register that's also defined as MMIO register. All in all, I want to point out that AMD IOMMU actually support two type of IOMMU page table. The first one is called host IOMMU page table, also called V1 table. The paper format is IOMMU specific. What it does is to do the gas PA to SPA IOMMU retranslation, used for VF IOMMUNI. And the second one is called V2 table. Also called gas IOMMU page table. The table format actually is X86 compatible, and it could be used for DMA API. And end user is allowed to use these two tables in different fashion. For example, you can use V1 to do IOMMU device pass through, and you can use V2 to do DMA API, for example. You can also enable DMA API when you pass through a device into a gas VM. So this is the third case, the nested IOMU table case. Over the years, we have seen a lot of requirements from end user to enable IOMMU in a gas VM, either for gas IOMU protection or share virtual memory and other purpose. A lot of users use the software-based virtual IOMMU solution. We know that software-based solution needs to intercept and emulate IOMMU register. So that could introduce extra overhead. On top of that, the IO page table management is another layer of overhead that we need to take care of. In the example below, as you can see that we have a 10-gig Nick pass-through device running inside a VM. And there's a virtual IOMMU in the VM that manages this device. So from gas point of view, the pass-through device, gas IOVA will be retranslated to gas PA. That's perfect. But when you run this model on a real hardware, actually the 10-gig Nick is managed by host IOMMU. So in this case, the host IOMMU IO page table needs to do the retranslation from gas IOVA to SPA. Obviously, this is very similar to shadow IO paging. And the problem with this solution is any time the gas IO page table being updated, for example, mapping or mapping or invalidation, you need to reconstruct the shadow table. Because of that, that could cause performance overhead. In this slide, this is a detailed example of software-based IOMMU solution. It's called virtual IO IOMMU. This is a software-based solution that's vendor-independent. So you can enable virtual IO IOMMU for x86 or ARM gas VM. And the design of this software-based approach, actually, is very straightforward. It just leveraged the virtual IO framework use, virtual queue for the request and event communication. And also, this device provides basic command for IOMMU management. For example, attach or detach device to domain or mapping or mapping gas VA to gas PA. We found that this approach is very simple, straightforward, and actually high-performance. In the later slides, we will show you some of the performance results we collected using the IO IOMMU compared with our hardware-based solution. In the last few slides, I talked about the software-based approach problem and overhead. In order to solve this problem, AMD is coming up with a hardware-assisted VIO IOMMU. This solution is targeting to solve some of the performance issue of software-based approach. And it's also required for DMA into secure gas VM, such as SMP gas. So this hardware solution includes a lot of different tricks to solve the software-based performance issue. For example, the gas IOMMU command will be interpreted by the hardware directly. And the other example includes the head and tail register of the command buffer event log. Those will be handled by hardware as well. I also want to point out that hardware-based IOMMU require us to use nested IO page table. So on the diagram right-hand side, you can see that inside the gas VM, the gas IOMMU driver actually program a V2 table that translates from gas IOVA to gas PA. And on the hardware IOMMU at the bottom, the host IOMMU driver program V1 table translates from gas PA to SPA. You can view it as a nested IO page table approach. And with that, we expect it will be performed better than shadow IO page table base approach. So this is a quick recap about hardware VIO MMU. And Surabi is going to talk about the software prototype details. Hi, my name is Surabi. And in the next few sections, we're going to be discussing on the prototype of the AMD hardware assistive VIO MMU. And also, we're going to look at some performance number with this prototype. So the prototype has three parts mainly. It has the QMU changes that we're going to talk about here. First, we introduced the new AMD VIO MMU device model. This is different than the software AMD VIO MMU that exists today. With this new device model, we'll allow the QMU to specify the PCI to policy for VIO MMU in the password device. For example, on the right hand side, you can see that we have one guest that has two NIC pass through, which is NIC 0, 1, and 4. In this case, if you want to support VIO MMU, you need to also create two AMD VIO MMU device model, which is VIO MMU 0 and VIO MMU 1. The reason is that each mix belongs to different IO MMU. So in order to support the hardware base, we need to create two different instances and tie them together accordingly. And this can be done by specifying the QMU command at the bottom, for example. And QMU would use this information to generate the IBRS table that is consumed by the IO MMU driver in a guest when it initializes the VIO MMU in a guest. So the next part is to host IO MMU changes. So last year, we have presented the full architecture implementation of the hardware-assisted VIO MMU. This year, we're going to just recap quickly all the changes we did for the driver. Starting with the boot time initializations with the logic to detect and enable VIO MMU feature. And it will also allocate common use data structure, which are the domain ID table, which is used for mapping the host domain ID to the guest domain ID. The device ID table, which is used to map the guest, sorry, the host device ID to the guest device ID. And the command buffer dirty status table, which contains the command buffer dirty register, dirty status register for each VM. Then when we create a VM, the host IO MMU driver needs to allocate per VM data structure. And these are the guest MMIO register, the guest command buffer register, and then the guest event log, and the guest PPR log, which are per VM. So this can be seen on the right-hand side. Then it has to create the mapping for the device ID and the domain ID from host to the guest in the mapping table. And it will also have to trap the access to some of the MMIO register in that access by the guest to emulate some of the values to the hardware. Let me start over. That part's not going too well. All right, can you pause? Section two, round two. Hi, my name is Sir Ravi. And in the next few section, we're going to discuss the software prototype of the AMD hardware as the VIO MMU, and also we're going to look at the performance number that we got from this prototype. So starting out with the key new changes, we introduced a new device model specifically for AMD VIO MMU. The reason is that we need to be able to specify the PSI topology between the VIO MMU and the passthrough devices. For example, on the right-hand side, we have a guest VM that we will pass through three different NICs, NIC-0, NIC-1, and NIC-4. And as you can see, NIC-0 and NIC-1 is associated with IO MMU-0, and NIC-4 is associated with IO MMU-1. When we support device passthrough with VIO MMU in this case, we need to be able to correlate the NICs to its parent's IO MMU. So in this case, we will introduce two VIO MMU in the guest, VIO MMU-0 and one. And then we can specify the relationship using the QMU command line as we show below that NIC-0 and NIC-1 will tie to VIO MMU-0 and NIC-4 will tie to VIO MMU-1. So we will use the QMU to generate the IVRS table that will be consumed by the guest kernel to set up the VIO MMU in the guest device. So the next part is the host IO MMU driver change. The driver will first detect and enable the VIO MMU feature, and then it will locate the commonly used data structure. These structures are the domain ID table, which is used for mapping between the host domain ID to the guest domain ID. And the device ID table will be used to map the host device ID to the guest device ID. The command buffer dirty status table, which contains the command buffer dirty status, which is here for each VM. Then when we start a VM, each VM, the host IO driver will allocate per VM data structures. And as Wei mentioned, these are the structures that use mainly for DNA mapping. And we got the guest MMIO registers, the command buffer, the event log, and also people create the host to guest device ID and domain ID mapping. And it will also register the trap for the MMIO regions that need for the programming of the VIO MMU hardware. And it will also support the new IO MMU commands and events. So together with both QMU and host IO MMU driver, we introduce a new app. I also interface along with the new AMD VIO MMU specific device FS. The commands that we have here are also with mapping of device in domain. And also another command to handle the then it also needs to know when the command buffer is being set up so that this can be programmed onto the hardware. And the last part is the guest CI tree table update. The guest CI tree table is specific to the IO MMU and it's used for initializing the V2 page table. So for the guests, I am the IO MMU driver changes. The main change has to do with the support of method IO page table. And in this case, the guest driver will be using the guest table or the V2 table. And it will be programmed for the guest IO VA to guest PA. While the VFIO driver will be using the V1 page table or the host table to do GPA to SPA. And this will allow the IO hardware to walk the message page table to get guest IO VA to SPA at the end. So the sheet is in the driver is specific to the support of the IO in the V2 page table. And we have submitted an RFC to upstream and some more work that needs to be done, but this is the code that we actually use for our prototype today. So in the next section, let's look at some performance number based on the prototype that we have just discussed earlier. And the goal of our experiment is to compare PCI device pass through performance of the AMD hardware, the IO MMU with other VRM solutions. And also to try to understand where we can make improvement in our design. In our experiment, we have four different configurations. The first one is the bare metal with no IO MMU. This will help establish the baseline for our performance comparison. Then for the visualization solutions configurations, we have three different configurations. The first one is the guest VM with pass through device without IO MMU. And for the second one, we have the guest pass through device with the hardware assistant VIO MMU. And this is using the changes that we have discussed in our prototype section. The third part is the boot IO MMU solution configurations with the latest guest kernel with the X86 boot IO MMU support. The first benchmark that we look at is the FIO benchmark doing random read and write. We have two different devices here. First is the Samsung SSD M.2 NVMe. And then the second one is Intel SSD U.2 NVMe, which has slightly better performance. Both cases are using the same as NVMe driver. And we are using the default configurations where we have not done any performance fine tuning in the driver in the benchmark. So looking at the number in the table, except the case of the Intel write, we are seeing slightly more performance degradation with the virtualization case with no IO MMU. Now, when you look at the virtualization comparison between no IO MMU, hot baby IO MMU and word IO IO MMU, we are seeing that with hot baby IO MMU, we're actually getting up around 65 to 75, sorry, 73% performance, where we get about 34 to 42% performance for word IO MMU. Except the write case where we only get about 220 Mbps and 127 Mbps for the VIO MMU case. Looking across the board here, it seems that we are actually hitting some performance bottleneck at around 230 Mbps for hardware assisted VIO MMU. So we have not done further investigation on this, but I think this is a good data point to have for the benchmark. For the next benchmark, we are using netperf running tskstream. And the first device that we are using is the Intel Technical Bit Ethernet in dual port configurations with the physical loop back mode. And one of the ports is used by the host running netperf, and then the other port is used by the guest running net server. And in both the guest and the host, we are using the Linux IXGB driver. Same thing in default configuration, no fine tuning, no optimization here. And using the virtualization with pass through net with no IO MMU as the baseline, and you can see the line rate, which is 9.4 gigabit per second. On the receiving, it seems like we're getting almost a line rate for all three virtualization configurations with no IO MMU, hardware VIO MMU and VIO IO MMU. On the sending side, with hardware VIO MMU, we're getting about 83%, which is quite good for VIO IO MMU, which only got around 24 gigabit per second. Sorry, 24%. So I think this is a good data point to show that the hardware IO MMU can actually help make improvement to the VIO MMU solution on the network benchmark as well. Now, we also try it with slightly higher performance. We're using 100 gigabit internet millenops in the same configuration with dual port and physical loop back mode. Here we're using a different driver, which is the MLX5 core driver. But we have the same configurations, no fine tuning. And if we look for the baseline with the virtualization with device pass through with no IO MMU, seeing that we're getting around 20 gigabit per second for send and receive, but with the hardware VIO MMU, we're still getting around 30 to 40, sorry, 30 to 50% performance. And for the VIO IO MMU, we're getting about 13 and 42% performance average. So here, we would expect that it should be getting higher performance based on the higher performance hardware, make hardware. But when we compare between Intel and Melodox, we're seeing that it's actually hitting right around same bottleneck in both cases. So we would like to understand this better, whether this is going to be a software related or some of the implementation specific for hardware VIO MMU that we designed. One other investigation we did was to run perf utility with kernel trace point in the guest and also perf utility with the IO MMU performance counter on the host. And what we found is in the guest, the guest IO MMU driver actually is doing a lot of IO MMU page invalidations. And this is due to the mapping and the mapping of pages in the guest. And what happened here is when you invalidate the page, it invalidates the translation in the TLB causing TLB misses for the subsequent DMU access and IO MMU has to walk the page table. In case of hardware VIO MMU, this table walk is going to be nested. So we go into your slightly higher latency than the single level, a single layer page walk. Also, the observation we have seen here with regards to invalidation of pages, this is not actually specific to hardware VIO MMU solutions. Because we know that even with the bare metal case with IO MMU, this would be a bottleneck as well. The key takeaway here would be if we can improve the invalidations of pages of IO MMU, it will help in both bare metal case and in hardware VIO MMU case here as well. So in summary, the AD hardware VIO MMU actually shows improvement in PSI pass through IO performance for the guest IO protection case when comparing to other software VIO MMU solutions. And some of the improvements that we would like to focus is to get to work on better IO and the UTLB invalidation scheme to reduce the number of invalidations and number of page table walks that hardware has to do. And there are some improvements that we can do here, because currently in the implementation of the MDIO MMU, there are two types of TLD invalidations. If it's a single page invalidation, it will basically invalidate just that page. But if we need to invalidate more than one page, it will invalidate everything. So there's going to be more pages that we need to do table walk to get the translation. So that's one area of improvement we can do that might help with the situation. As far as the next step, we were working on upstreaming the prototype to Linnus Community and QMU. And we also will be looking at experimenting with other VIO MMU use case. Another aspect would be to find a solution where we can have hybrid VIO MMU models where we can have both software VIO MMU and hardware VIO MMU on the same gas, because currently the hybrid MMU is used for the password device, where the software VIO MMU is used for the emulator device. So with the hybrid model, we can have the fully protected gas. And another aspect of hardware VIO MMU is to support the interpretive mapping is currently under development. Thank you. And next we'll be opening for all questions and discussion.