 Good luck. Okay. Thank you. So, good evening everybody. Thank you for attending this last talk for today. So, my name is Eric Ogier. I work at Red Hat in the virtualization team. So, this talk is about virtual IUMMU implementation in the context of a special use case. So, the device passes through. And I'm going to focus especially on one implementation based on hardware nested So, first I will introduce the IUMMU hardware. We will introduce some terminologies that we are going to use all over the presentation. Then, we will focus on the device pass through use case. Especially, we are going to look at Linux and QMU integration. We will see how the physical IUMMU is used. We will add a little more complexity by adding the virtual IUMMU in the picture. We will see that depending on the hardware, we have different implementation choices to implement this virtual IUMMU. And I will detail the integration at Linux and QMU level for the system IUMMU which is the ARM IUMMU. And I will complete this presentation by the status of the work and remaining challenges. So, the IUMMU is an hardware block that is put in between a DMA-capable device and the system interconnect. So, the job is to translate the DMA requests that are expressed in a given address space into an output address space. And on top of that, it also performs some permission checks on the transactions. So, the principle is that the device tags its DMA transaction with a unique ID. On ARM, this is called the stream ID. On Intel, this is the source ID. And this unique ID allows the IUMMU hardware to locate some configuration structures that are usually located in external memory. Those configuration structures tells how the translation needs to be performed, what are the characteristics of the translation, and obviously, it also points to a set of page tables that are located in the external memory. Once the IUMMU hardware has done this configuration lookup, it is actually able to perform the page table work. So, it looks first into some caches within the hardware, which are called the IOTLBs. And if there is a miss, we do not find the translation there, so we are obliged to do the page table work in external memory. So, now let's have a look at the device assignment process. So, first of all, what this use case consists in? So, it consists in exposing a physical device to a guest operating system. And the purpose is that the intent is that we want the guest OS to get the ability to interact and program the device as if it were physically attached to it. So, with respect to DMA programming, this means that the guest is going to program the DMA register target address registers using its own address space, which is the guest physical address. And you want this physical device to be able to reach the whole guest RAM. So, somebody in that process needs to map this guest physical address onto the OS physical address that backs the guest RAM. And so, besides this actual translation, what we want also is to guarantee that the assigned device won't be able to reach any other OS physical address. Typically, we do not want it to reach OS physical address used by the OS system itself, and we don't want it to reach OS physical address that backs part of another VM. So, this assignment process brings DMA isolation. Okay. So, now if we do stick to the scheme I explained just before, the trouble is that the assigned device is able to reach the whole guest RAM. And in that sense, it can jeopardize the integrity of the guest OS because it can reach any GPA. So, what we want on top of that is to restrict the DMA capability of the assigned device to a particular space within the GPA. So, this is achieved by adding the virtual IUMMU. So, an IUMMU that is exposed to the guest this time. And with that trick, actually, the guest drivers are going to program the assigned device no more in guest physical address, but using a special address space which is called IO virtual address space. And that way, we are going to restrict the guest physical address that are reachable by the assigned device. And on top of that, it brings commodity like scattergazer commodity. You are not forced to use contiguous guest physical address. You can use this IOV space. And in the Linux context, this means that as you expose virtual IUMMU and IUMMU, you can also use the VFI OPCI driver on the guest. And this is especially useful for two use cases which are important. The DPDK use case, when you want to develop a network stack that uses user space drivers, this relies on the VFI O API. And also, if you want to assign your device several times, meaning you want to do assignment on your guest. And in that case, you will also use VFI O to achieve that goal. Now, if you look at this picture on the right, you can see now you have two translation stages. So I use this terminology of stage. So if you look at Intel's terminology, they talk about levels, but this is exactly the same. So here we talk about logical stages. The first one is mastered by the guest operating systems. When it programs the virtual IUMMU, it actually programs the stage one. This is the first stage that translates guest IOVA into guest physical address. And then you have the former translation stage that is programmed by the virtual machine monitor that does the normal device assignment, I would say. And that translates guest physical address into host physical address. So those are logical stages. I do not make any assumption on how it is going to be implemented at hardware level. This is what we are going to see just after. So now let's have a look at the hardware capabilities. Because you have those two logical stages to be implemented. You remember that you have a physical device that you are going to assign to your guest and this physical device is protected by the physical IUMMU. So this means that at some point of time you need to program the physical IUMMU. And this programming must be consistent with the two logical stages we have seen before. So at physical level, there are different kinds of hardware implementation. Either the IUMMU can implement a single stage or a single level for Intel. So this is typically used for the scattergazer use case where a guest operating system is going to translate an IO virtual address into a physical address. Second use case is that your hardware can implement still a single stage but not meant for scattergazer but rather meant for virtualization use case. In that case, it's meant to translate a guest physical address into host physical address. So those are two different stages. The translation process is different for stage one and stage two. You have different configuration structure for each. So it's a choice when you implement at hardware level to choose either a stage one or a stage two implementation. And some hardware implement what we called the hardware nested page mode which consists in chaining those two translation processes. So you first execute the stage one and the output of the stage one is input into the stage two. And this kind of architecture was really devised for the use case we just saw before. So one stage is in principle controlled by the guest operating system and the other stage is controlled by the hypervisor. So now we have two logical stages to be implemented. We have different kinds of hardware implementation. So let's see now what kind of implementation choices we have. So the first case is that your hardware only supports a single stage of translation. So you have two logical stages to map onto this physical stage. So what is done is that the virtual machine monitor is going to trap on each configuration and translation structure updates performed by the guest. It will do by software the translation of the first stage. So the IOVA to GPA translation, it does that. So it happens in QMU, in the virtual IMU device. Then the virtual machine monitor is able to associate the output of this translation to the host physical address that backs the memory. And it combines the two translation in one and program the physical IMU and this is actually the mapping that is programmed at the physical level which is at the bottom. And so this is what happens on the mapping. Now when the guest destroys a mapping, it sends invalidation commands to clear the IOTLB entries that correspond to this mapping. This is trapped by the virtual machine monitor. And so we invalidate the IOTLB at the physical level. This is the way it works. Currently when your hardware only supports one physical stage and historically this is the implementation used for X86, DMA remapping integration. So for the implementation of the virtual IMU together with VFIO. The problem, so there are different... So the first issue is that it induces quite a lot of huge penalty for use cases which do a lot of map and map and so on because each time you map something, so you go through this process, you do the translation by software, you do the association between GPA and NHPA, program the hardware, so it's very costly. So it's a trouble, so it depends on the use case. Typically DPDK does not have such kind of dynamic mappings, but whenever you use an actual native Internet driver on the guest, he is allocating DMA buffers very frequently, mapping them and so on, so it's very costly. Besides that, so this is an existing implementation, so it perfectly works for Intel together with QMU and Linux VFIO. The trouble is that when we try to reuse this framework for the integration of the virtual ARM IUMMU, which is called SystemMMU, so SystemMMU, it's not possible to reuse this implementation because the SystemMMU suffers one weakness, the specification does not force invalidation on map. So this means that when the guest is creating new entries in the page tables, there is no way to trap that at virtual machine monitor level. So the way it works on Intel is that they devise the special mode, which is called the caching mode, and when this bit is exposed by the virtual IUMMU device, this forces the guest driver to invalidate also on map. And this trick is not specified in the ARM SystemMMU specification. So we cannot reuse that. So we were obliged to look at something else and this alternative is based on hardware nested paging because the SystemMMU specification provisions specify this hardware implementation. So this is an implementation choice to implement both stages at physical level. Some platforms won't implement that, but some do. So in that case, it is straightforward because you have two logical stages to map onto the two physical stages. So what happens is that one stage is mostly owned by the guest operating system and the other stage is owned by the virtual machine monitor. So the principle, so each time the guest operating system is updating its configuration structure, we trap this event at virtual machine monitor level and we plug the stage one configuration built by the guest operating system. We plug it into the main configuration structure managed by the virtual machine monitor. So the nice thing is that the structures are well isolated on ARM. So on one side you have the stage one configuration on the other side you have the main configuration structure that contains the stage two configuration and that's well separated. So the main configuration structure which is called the stream table entry on ARM is mastered by the virtual machine monitor for security reasons because you couldn't have the guest operating system that updates directly this structure because it could change the mapping assigned to this stream. So now let's see the changes needed to integrate this nested mode. So actually it's a bit surprising because this nested mode is used for MMU for a while, MMU for CPUs I mean, but for IU MMU it has not been enabled yet. So the framework both at QMU and kernel level needed to be updated to support that. So the first thing we have to do is each time at QMU level we detect there is a guest from region that is added. We need to inform the virtual machine monitor so that it programs the stage two. So the guest physical address to us physical address mapping. So previously stage two was not used at all. So it was stage one. And it goes through the VFIO PCI IO CTLs, so DMA map. So as I explained previously, each time the guest does some changes to the configuration structure we need to pass this update, this information to the host kernel so that it can update the main configuration structure and associate the stage one configuration structure program by the guest operating system. So we introduced a new IO CTL at VFIO PCI level to do that. Okay, we have to propagate also cache invalidation commands from the guest to the host operating system. We have a special handling for MSI. I'm going to comment that in the next slide. And since now all the translation is performed at hardware level. So stage one and stage two translation. Whenever we detect there is a stage one translation fault at hardware level we need to propagate this fault up to the guest. So it requires some special mechanics. So both in the IUMMU subsystem and VFIO PCI level to export this information. So you can see that at kernel level there is impact at VFIO level at IUMMU subsystem level and there are some state machine changes requested also in the SMMU driver. At QMU level there are quite a lot of modifications as well because we modify the existing framework. We need to add some new notifiers each time the QMU system, system MMU device detects some of those changes. You need to propagate the information to the VFIO PCI device so that it can use the new AUCTLs. Okay, let's talk about MSI on ARM because it's a bit of a nightmare. So the problem is on most of the platforms the MSIs are translated by the system MMU. So it's not like on x86 where MSIs use some special address belonging to a special window, the AUIPG window. So it's not like that. So they are translated through the system MMU. And the trouble is that historically VFIO only mapped the guest physical address onto host physical address for RAM but he didn't especially handle MSI doorbells. So what we added in the past to support MSIs along with PCI pass through is that the host kernel is responsible for allocating IOVAs so that there is a mapping that exists to reach the host doorbell. Okay, and then it follows the MSI hierarchy routing path. It injects the virtual MSI into the guest. So it's essential that the device is programmed by the host IOVA to reach the physical doorbell actually. Now the problem is that you add another stage and there is no mapping at stage one that exists and that allows to reach the host the physical doorbell. On guest side actually the guest is also exposed with virtual IUMMU so it has exactly the same trouble. It needs one guest IOVA to translate into the guest doorbell. But the problem is that both mappings do not match and so if you do not do anything special so each time the assigned device will perform an MSI it will fault at stage one because there is no mapping. So what we need to do practically is that each time there is a mapping that is created by the guest operating system to map its guest doorbell so the virtual doorbell using its guest IOVA. So we trap this event and we pass it to the host kernel so that it can reuse the guest IOVA and create a nesting mapping that reaches the physical doorbell. So it's a bit tricky but so eventually it works and now the assigned device uses this guest IOVA to go through the two stages. So the question is do we actually we still do the shadowing but so I would prefer we keep this until the end because it will take care. Okay so I'm done with the MSI handling stuff and this works. So what is the current status? So it has been tested on two new generation platforms for ARM servers so Qualcomm and Cavium platforms. It's still at the state of RFC at both kernel and QMU levels. The problem is that so I explained we had changes at VFIO and IOMMU level on kernel side. And the APIs we are defining are supposed to be reused for ARM, for VTD, for shared virtual memory use case also on Intel and ARM as well and VIRTAU IOMMU so we need to define APIs that are generic enough to be reused for all those use cases and it takes time and we have not reached stability yet. With respect to performance so nobody is very clear on the performance we will be able to achieve with this nested stage integration. The problem is that at stage one level all the addresses that are used are guest physical address. Would it be the addresses of the configuration structure, the page tables and obviously the output of the stage one. So all that stuff is going to induce some page table work at stage two and this will induce more traffic and more TLB entry being used. And I would like to complete this presentation by two pointers to presentation made last year at KVM forum. The first one made by Peter was about exactly the same use case but on Intel so using the so called caching mode so you can dig into more details with respect to this use case. Also presentation about shared virtual address in KVM so it's a connex use case so more Intel related for this presentation and you have the pointers to the kernel and QMU series if you want to test the feature. So thank you for your attention. Okay so maybe we can come back to Paul's question. Okay so I'm not sure I fully caught your question but so maybe I can come back to this drawing. Okay so when the IRQ routing setup is made so the KVM IRQ routing is set up actually we trap this information about this banding. Okay we pass this to the host and we are reusing this guest IOVA instead of the host IOVA that was formally allocated by the host system. I'm not sure about your question so I will need to understand the question first. Yeah yeah I prefer because it's quite... Okay any other questions? Okay so thank you.