 Hello, everyone. Thanks for joining this session. My name is Yi Liu from Intel Corporation. Today, I'm going to share the passive management in KVM together with my colleague, Jacob Pan. This is the agenda for the session. I would do a passive recap first, and then review the passive usage in shared virtual addressing and the Intel Scalable LV. After that, Jacob would introduce more details on the passive management from the software side. Passive stands for Process Address Space ID. With the introduction of it, DMA Remapping happens at the request ID and passive granularity. To achieve such isolation granularity, platform vendors should also support IMU Passive Table. It is a part-device table by hardware design and its storage in virtualization environment differs across vendors. Like inter-virtualization technology for directed I.O., it is maintained by host and IMU lessed translation. But for ARM System Memory Management Unit version 3 and MD-IMU, it is maintained by guest and lessed translation. This difference results in different ways to set up IMU lessed translation for guest. One is to bind guest passive to host one by one. Another is to bind the whole guest passive table to host. Okay, let's see the passive usage in shared virtual addressing. The diagram in the left shows the steps to set up SVU usage in native. Application will issue bind process request, which goes into device driver, and then will close into IMU driver. IMU driver will allocate a passive and bind the passive with the CPU page table of the current process by creating a passive entry in the passive table. Then passive is programmed to hardware device. After that, device is able to access the process virtual address space with the passive. When it comes to virtualization, guest follows the same steps to set up SVU. However, hypervisor needs to travel guest's specific operations in order to set up lessed translation for SVU. For example, inter-virtualization technology for directed IO requires to travel guest passive location and the passive cache flash to set up IMU lessed translation for each passive in host site. For Windows, which allows guest to maintain guest passive table and lessed translation, it needs to trap the guest passive table initialization to bind the whole guest passive table to host. After set up guest SVU, device would be able to access the virtual address space of the guest application with the passive. Passive is also the foundation of inter-scaled IOV. Each available device interface will be associated with a host passive. This association happens when its parent device is attached to a domain in auxiliary manner. IMU driver will allocate default passive for exterior domains. With the default passive, available device interface can access to the virtual machines guest physical address space. As of now, passive is no longer just a tagging process address space. It can also tag guest physical address space. So it's actually IO address space ID instead of just a process address space ID. This is also the base of passive management in software. Since SVA and SIOV are both based on passive, can they coexist? The answer is yes, since they are orthogonal IO technologies. For example, we can set up guest SVA based on assigned ADIs. Guest just needs to follow the normal steps for SVA set up. However, there is still a difference on the passive program between physical functions and ADIs. For physical functions, passes from guest are programmed to hardware directly. While for ADIs, passive program should be mediated by a host, which means passive from guest should be converted to host passive first and then programmed to hardware. My colleague Hao just talked about interest in command for this guest passive to host passive translation. While it comes to the example in the diagram, if VM0 programs guest passive to device A to VM0 for device A, it will get guest passive while for VM0, it will get a host passive. Considering the host support for net translation and the IO page request from device, host will have both guest passive and host passive simultaneously. This may have potential conflict, so we need to get a proper passive management in host side. This would be introduced by Jacob. Hi Jacob, I think you can take over it now. Thanks. Hi, thanks Yi for the introduction. Again, my name is Jacob Pan. I will continue to talk about IO ACID and passive management. This is a generic library we introduced since kernel version 505. But ever since we have been continuously improving it to meet guest SVM IOVA use cases. So in the next few slides, we'll touch upon four aspects of the passive management. The first one is guest host passive mapping. And then because IO ACID is a system-wide resource, we'll talk about partitioning and namespaces with support. Also, IO ACID is just not a single simple number. It has multiple users with hardware contacts associated with it. So we'll talk about how to synchronize the IO ACID states when things change. In the end, we'll walk through a typical life cycle of IO ACID in the normal flow. Well, we probably don't have time to talk about exception cases. Now we move on to guest and host passive mapping. So far, there are two main approaches to support guest passive. The first one being shadowed guest passive table. This is used by VDD scalable mode. The second is to bind guest passive table. This is used by ARMS as MMU version 3. So now looking from IO ACID point of view, what are requirements because IO ACID is a generic infrastructure. So if you look at the shadowed approach, it requires a guest host passive translation because guest host passive may not be equal to the host passive. It also requires every single guest passive as a host passive backing. Because on scalable mode, we support so-called shared work queue, meaning a single work queue can be assigned to multiple VMs. Therefore, in order to identify two DMA streams with passive, the passive value must be unique. That's why the passive value, the passive has to be system-wide. So there's also a little caveat when it comes down to PF assignment. When you assign a PF to the guest, the guest can directly program a passive onto the physical device, which is not mediated. This is fine when you don't have any other devices that allocate passives from the host. But when it comes to a mix with assignable device interfaces, which is mediated device, and it allocates a passive from the host, so they may create a conflict. Therefore, some sort of enforcement must be done to prevent this conflict. If you look at a second approach, bind as passive table, it's much simpler because the guest simply owns the passive table, and the host simply don't care. In terms of requirements for IOACID, number one is a super set. So IOACID is a limited system-wide resource. In the PCIe spec, passive is 20 bits. So we must partition them into groups in order to support multiple users. In this example, we have three IOACID groups. We call it IOC sets. Set zero is used for host usage, such as native SVM or native IOVA. And IOACID sets one and two were given to the two VMs we have here. In terms of namespaces, we do not support multiple namespaces. Each native environment has only one single namespace for the IOACID. So in this example, the guest pass, the VM1 could have IOACID number one, and VM2 can also have IOACID number one. But the backing IOACIDs are different. The host is 101 and 102 respectively. They must be unique to identify the DMA streams that match different page tables. So in terms of sizes, we illustrated here the VM2 has a smaller quota for the IOACID set because the system administrator may give less resource to VM2 or to VM1. They could be depending on how many assigned devices they have or other use cases. So IOACID is not a simple number. On a real system, it has many users and each user may have hardware context associated with it. In this particular example we'll use Intel's VTT scalable IOVA platform. On this platform, we could have five potential users of the IOACID. So for CPU it has a PASID MSR. This is used for in-Q command. They must be propagated or set up before in-Q command can be used. The VFIO is the pure software construct. However hypervisor must use VFIO to communicate with the kernel for allocation, free bind-on-bind of the PASID. IOMMU of course stores the PASID table and the context. It performs the actual bind-on-bind of the guest page table and the PASID. The KPM maintains a PASID translation table that performs the guest host PASID translation. Device driver such as media device they program the actual PASID onto the device in order to generate DMA request with PASID stream. So in order to synchronize all these users when PASID state changes such as being unbind we must get some sort of a notification. Here we proposed a IOACID notification chain for each VM or each IOACID set. For example when the PASID is unbind KVM will receive a BI notification event and then it will tear down its entry in the PASID translation table. There is also notification priorities and other things we will talk about in the upcoming slide. So now let's talk about PASID slide hopefully it's a little more interesting than a bug slide. On a typical page it consists of five steps. The first is initialization. This is done on a per VM basis. But on a per PASID base it really has just four steps allocation of the PASID bind with a page table guest base table and unbind and free. So DMA can start after step three. So now let's walk through this example. This is a slide in the VTD skittable mode as example. Here it's so starting with a VM issues allocation of the PASID. Once the PASID is allocated it can be passed down to the VFIO interface through the bind on bind IOACID this gets propagated to the IOMMU driver to perform as the bind. They basically set up the PASID in the PASID table where the nested translation is enabled. Once the PASID is bound to a guest page table IOMU driver can notify the rest of the users of the status change. So the PASID is ready to go for all the users such as KVM so CPU can issue into command or device driver can start doing DMA on that PASID. So the notification can reach the other users such as KVM and device drivers. Of course during the initialization time KVM and device drivers must register notification. Those are per VM or process and then basis. We have a mechanism to identify the common token between users of the same process. So the tear down is just a reverse process. Initiate it from the VM to do unbind and pass on the IOMU driver performed unbind and then notify the KVM or device driver. For example when KVM gets notified it will clear its PASID table entry for that particular PASID and we also implement the web counting mechanism to make sure the PASID life cycles are clearly aligned meaning not until the last user drops references of the PASID the PASID will not be returned to the pool for reallocation. So now the status update so our team has been working on scalable IOV enabling shared virtual memory for the past 2-3 years. So we have made a lot of progress in IOMU driver VTT driver and the user APIs. So here we listed a couple of related features related to PASID IOACID core enhancement and VFIO interfaces. We also have some opens we want to discuss in this forum. The first one is should IOACI allocation exclusively being done so VFIO as we have today or we can have a standalone user API. It has many implications for complexity and life cycle management. Also how can a user manage IOACI quota from a system admin perspective. So we have done some research on using our limit or C group but those seems to be too heavy handed for a simple PASID allocation quota management. So finally let's summarize what we covered today. So we talked about the MA request with PASID with remapping this is done at the request ID and PASID granularity. PASID in Linux is managed by IOACID core. We don't support multiple namespace for PASID on the host. The cases also has its own PASID namespace. PASID could have multiple users each with hardware context association. So we must not synchronize them during setup and tear down. So we use notification and ref counting to manage the life cycles. I want to thank you for your attention and IOACI are ready for questions and more discussions on the two opens we have. Thank you so much.