 Hi everyone, I'm Eric Ogier from Red Hat and together with Iliou from Intel, we are going to talk about the new IOMMU-FD user API and the adaptation of the QMU VFI device on top of it. We will start by looking at a subset of the VFIU API, which depends and relates to the IOMMU subsystem. Then we will talk about attempts we made to extend this API to support new features such as nested and virtual SVA. With this background, you should better understand how we came to the design of a new user API. We will present IOMMU-FD, introduce the kernel skeleton RFC that implements it and the few VFIU kernel patches added on top of it. Then we will address the QMU integration, what challenges it brings to adapt the QMU VFIU device on top of it. Then I will handover to you, we will talk about nested and virtual SVA support on top of IOMMU-FD and eventually we will give some prospects and draw some conclusions. The VFIU API allows users space to have direct access to OSDMA-capable devices. VFIU guarantees the assigned devices put in a protected context so that the user-initiated DMAs are prevented from doing any harm to the rest of the system. This relies on the strong assumptions that a physical IOMMU protects the actual device. The API is group-centric. The group is an IOMMU thingy. It is a minimal isolation granularity for the IOMMU. Devices that cannot be isolated from each other belong to the same group while others belong to different groups. Associations between devices and groups can be discovered through this FS. Once a device is bound to the VFIU driver, a so-called VFIU group is created. This matches the IOMMU topology. If all the devices belonging to that group are bound to the VFIU driver or stop driver, the group is considered as viable and can be attached to the so-called VFIU container. This container embodies the IOMU address space shared by all the groups attached to it. Only once the group is attached to the container, user space can get an FD to the VFIU device and program it. Now, since 2017, we have been trying to extend this API to support new features such as hardware-nested paging and PESID. A set of new IOCTLs were proposed to handle PESID allocation such as guest page table binding, cache invalidation, fault handling, PESID allocation. First, we work on the definition of an IOMMU user API and we ended up tunneling this user API through the VFIU user API. This was contributed by Intel, Inaro, and Red Hat. The most difficult part was PESID support, actually. However, by this time, VDPA also needed to consume PESIDs. So different pass-through frameworks started to have the same needs and duplicate the same logic. First, a centralized driver was created to manage PESIDs, which was called slash def slash IO acid, and then eventually discussions added towards putting also the IO patch table management in the same driver instead of in a separate place. On top of that, the FIU IOMMU type 1 implementation has become bigger and bigger with the addition of new features and this was not really scalable anymore. So devising a new driver may also give you opportunity to alleviate the maintenance burden but also overcome some known shortcomings such as a duplicated log page accounting across containers and also bad virtual IOMMU performance with nested. So eventually, despite the efforts consumed on the VFIU integration and the fact that some parts were already upstream at kernel level, I'm thinking about the IOMMU user API, for instance. At the beginning of 2021, we stopped the developments around this VFIU integration and started discussions around a new user API for managing IO address spaces pointing to user memory. This new API is named IOMMUFD. It is separate from VFIU so that other drivers like VDPA can use it. After a long discussion threads and two spec proposals, LSE were sent on the main list. The first one from Intel and then new ones from JSONGIRNTOP on March this year and a few days before this presentation actually. In the rest of the presentation, we are going to focus on JSON's RFC and its integration in QMU. It is also worth to mention that plenty of work has been done at kernel level to prepare the introduction, to prepare the transition towards IOMMUFD. I'm typically thinking about DMA ownership rework at the IOMMU subsystem level. JSON's RFC proposes a new slash def slash IOMMU char device. The associated file descriptor is called IOMMUFD. This is a container holding for multiple IO address spaces. Those are called IOS. They are managed through FD operations. An IOS represents a mapping between a set of IO virtual addresses onto a set of physical pages backing user memory. It can be shared between multiple subsystems like VDF, IO, and VDPM. Devices attached to IO address spaces. And this attachment is backed by a so-called hardware PT, hardware patch table, which acts as a wrapper onto the IOMMU domain. The current skeleton introduces or is involved in infrastructure to manage the life cycle of those objects. So IOS, device, hardware PT, infrastructure to store IOVAs, physical pages, their pinning, mapping between those, and also currently a few IOCTS are exposed. So typically to allocate free IOS's, which returns IOS IDs, map and map IOVAs, and copy mapping from one IOS to the other. On top of that, a kernel API is also provided for external drivers like VFI or PCI to bind and bind devices to an IOMMUFD. This returns a device ID and claim DMA ownership. This means a successful binding puts the IOMMUFD group into a security context, which isolates the DMA from the rest of the system. So in this skeleton, group isolation is enforced implicitly. Also, an API is provided to attach a device to an IOS through an hardware PT. Currently, an hardware PT is automatically created with an empty IO page table to match the VFI or block DMA state semantics. Also an IO, an hardware PT ID is returned. The current RFC does not expose any IOCTS able to manipulate it. However, the second part of this presentation will explain what it is used for. On top of JSON's IOMMUFD skeleton, some additional patches were needed to allow VFI or device to cooperate with the IOMMUFD subsystem. So you contributed those. First, once bound to the VFI or driver, a device node now appears in CCFS and you can open it without going through the legacy container group API described earlier. Then also two new IOCTS were introduced to allow the binding of the device to a given IOMMUFD and the attachment of the device to an IOS within this IOMMUFD. So actually those patches weren't officially submitted to the mailing list, but they are available since the first RFC submission on this branch and other branch provided by JSON. Now let's cover the adaptation of the existing QMUVFI device on top of this new IOMMUFD user API. As IOMMUFD may not be a superset of VFI or IOMMU type one, both the legacy and the new implementation may need to coexist. The assumption is that new functionalities will be implemented only on top of IOMMUFD. The goal was to port most of the existing code on top of the new API. One of the first things we did was to split the code into parts that are IOMMU agnostic and parts which are IOMMU specific. This mostly relates to command.c file. In the IOMMU agnostic part, you will find all the code related to region, interrupt management. This is a code that applies to the actual device. In the IOMMU related code, we try to isolate code that looks generic related to VFI or the way space, memory listeners from lower level code that manipulates the user API. One challenge is VFI is a group-centric API whereas IOMMUFD is device-centric. Conceptually, the VFI container matches to the IOMMUFD-IOS tuple. Devices sharing the same tables are attached to the same IOS within the same IOMMUFD. We turned the VFI container into a base class derived into two implementations. So we needed to devise a class interface. The implementation of those interface is supposed to fully hide the API-specific objects such as the containers, the groups, the IOS and so on. At the moment, we have defined those two interfaces. You can see them on the left. This is likely to be extended and refined because we are currently missing some kernel pieces to implement some functionalities like reset or migration. For the user to be able to select one given backend, so the legacy one or the new IOMMUFD one, a new IOMMUFD object was introduced. To use a new backend, you instantiate such object referred to by a unique ID. And when you instantiate your VFI OPCI device, you reference this IOMMUFD ID. If you want to use a legacy backend, you still use the old common style. Not that management tools also have the possibility to open the IOMMUFD beforehand and pass the FD to the IOMMUFD object. While we have tested the legacy backend against non-regulations, the new IOMMUFD backend is not yet fully on par with the legacy one. As I explained earlier, we missed some kernel features to support peer-to-peer or reset migration and so on. It's still unclear anyway whether we will have full compatibility between the FIOMMU type one and IOMMUFD implementation. This is still to be discussed. The QMU integration at this stage is in the IFC state and we have not received much comments. So besides the suggestions, I would say to use a separate IOMMUFD to select the backend. So we are definitely waiting for your feedbacks. And now I am handing over to you for the second part of the presentation dedicated to new use case support. Yes, thanks Eric. I'm going to talk about the new use cases. Actually, this is teamwork by Nikoling Chen from NVIDIA, Eric from Red Hat, Lu Ba Lu and me from Intel. First, I'd like to do a recap on the next translation. It is a feature implemented by hardware IOMU and mainly used by hardware-assisted virtual IOMU. It enables GIOVA and VSVA. Most platform vendors like Intel, ARM and AMD have already supported it. The next translation has two stage presentations. The result of the first stage presentation is subject to the second stage presentation to get access to the final physical page. Platform vendors have slight differences in the next support. For example, the translation structure hierarchy is different between Intel VTD and ARM SMU. Intel VTD hardware uses the guest stage 1 page table directly and the next translation. While ARM SMU uses the guest stage 1 page table by using the guest context descriptor table, which is also known as pasted table. So, under the next translation, PMU needs to associate the stage 1 and the stage 2 page table to be nested. In IOMFD, each page table has a hardware-page table object to represent them. Together with the IOS mentioned by Eric, PMU needs to manipulate hardware-page tables and IOS. Here, I'd like to explain how QMU manipulates them by an example. In the example, the GPA IOS stores guest visual address to host visual address mappings. It is attached to device A and auto-HWPT is created. It stores the mappings from GPA IOS. Device B is attached to S2-HWPT and it is allocated by QMU. It stores the mappings from GPA IOS as well. For device C, it is attached to S1-HWPT, which is the next thing on the S2-HWPT. This enables the next translation in IOMU hardware for device C. In long-term, we may get rid of the attached IOS and only use attached HWPT. Then, we will not have auto-HWPT but for now, we want to keep aligned with existing VFR container. Auto-HWPT is still in use. After understanding how QMU manipulates the page tables for the next translation, let's have a look on the software architecture for the next translation. First, IOMU driver needs to support the next type of domains and for ARM, SMU even needed to handle the MSI doorbell properly under the next translation. As the MSIs are translated by both virtual SMU and the next user. IOMU FD under device driver like VFIO needs to provide the UAPIs for users base to set up and destroy next translation. With such kernel SM, QMU enables next translation by steps first acquiring the IOMU capability then allocate stage one and stage two HWPT table for the guest virtual IOMU's configuration for the device and the last attaching the device with the allocated HWPT table by device user interface. As in the previous slides Intel, VTD and ARM SMU has different nest support. So the stage one HWPT table on Intel VTD represents the guest IOMU tables while for ARM SMU it pointers to the guest passenger table namely the context description table. For next translation Nikoling Cheng from NVIDIA Eric and I have done a POC it's functional but not fully cleaned up yet. In the POC, stage one related IOMU operation is re-sealed by virtual IOMU like allocating stage one HWPT table in the framework part IOMU FD device is introduced to represent a device on the virtual IOMU site. It provides code backs like HWPT to support the attaching device to specific hardware table. Different device framework may have their own implementation for the code backs and long term we wish to move all the IOMU related code out of the device specific folders as most of the logic may be shared such as the memory listener address space and 30 page tracking called in the current device password framework. Now we have enabled the next translation but we only enabled the GIOVA usage. So let's go to talk about enable VSVA in the latest IOMU FD framework. VSVA is not a new thing. There are several related topics in recent cover forms. Besides next translation VSVA also requires IOMU folder reporting and passive support. In VSVA the stage one table is the guest CPU page table which translates guest virtual address to guest physical address and this is also why IOMU folder reporting is required for VSVA and the IOMU fold includes recoverable fold and another recoverable fold the recoverable fold is also known as PCI page request which is part of PCI address translation service. The IOMU fold reporting is generic across platform wonders. However the passive support is different due to hardware differences. For example Intel WTD the guest passive table needs to be scheduled to host the passive table under the next translation. This means guest passive support requires hypervisor interception but for ARM SMU the guest passive table is used by hardware directly so the guest passive support does not need hypervisor interception. IOMU fold reporting is based on the kernel IOMU fold reporting framework. Lu Baolu from Intel is moving the fold reporting to be per IOMU domain. Accordingly IOMU call also registers fold handler for the hardware page tables as the hardware page table is wrapper of full IOMU domain. The fold handler is responsible to report the fold to IOMU and also store the folder data in the ring buffer. The report is done by eventFD. IOMU sets the eventFD to the kernel in the Stage 1 hardware page table allocation. Along with the Stage 1 hardware page table allocation foldFD is also returned to QMU. So that QMU can get the data from this foldFD. After getting the fold data QMU IOMU emulation code will inject the fold to the virtual machine per vendor spec. If the fold is recover fold QMU will send a response back to hardware after guest serves the fold. We still have open on the pass it a support especially for the pass it a virtualization. However this slide wants to point out the necessary changes for it. IOMU FD will need to provide UAPI for host pass it a location and the map to host the pass it. In the IOMU FD core it needs to provide kernel API for pass it attachment and also API to query, guest pass it and host the pass it. In the device driver side for example VFIO needs to extend the attach HWPT UAPI to support the pass it. Last but not least KVM needs to provide UAPI for updating the VMX, VMCS pass it translation table. This is required for Intel in-Q-commander support. QMU both Intel VTD and ARM SMU emulation code is going to be updated to support the MFOTOR handling and pass it a capability for VSVA and Intel virtual MU further requires pass it communication with the device module and there are some conclusions for this presentation first IMFD is a major redesign there are significant rework at both kernel and user space level IMFD spec is still unstable especially for new features like pass it the feasibility of VFIO M1 type 1 the application is not guaranteed at this point and there are lots of kernel dependencies for QMU integration like clean-ups, VFIO IMU code reshuffles most of them are not merged yet Nikonichun, Lubalu, Eric and I are working on prototyping nested and VSVA on Intel and ARM discussions needed to happen to integrate other vendors like AMD other VFIO IMU backend such as VFIO IMU SPAPR TCE needs to be addressed at some point in future other new features like page request for VFIO IMU type 1 blocked waiting for IMU FD there are some references for people who want to know more about this project like there is some main list FD series to IMFD and also to informative stress for the IMFD UAPI discussion also some patch series post IMFD discussion like the first IMFD RFC from me the IMU FD generic interface RFC from Jason Gansoper and IMU FD adoption from Eric and me also wanted to clarify that the GitHub branch is pasted here are based on IMFD generic interface RFC V1 we are in progress to rebase everything on top of V2 there are also some related talks at various conferences in recent years which may be helpful as well in the last thank you for your attention on this presentation if you have any questions please feel free to let Eric and me know thanks