 Hi everyone, welcome to the presentation on shared virtual addressing for high performance ARM infrastructure platforms. My name is Vivek Gautam and I am part of the open source group at ARM. Here's a brief outline of my talk. I'll first introduce to the shared virtual addressing. Then we'll talk about the hardware and software requirements in order to implement SVA. We'll then go into the virtualization use case. As virtualization is one of the key enabling technology on various infrastructure platforms. Then we'll talk about the current design as proposed by the set of patches that have been posted to the mailing list. And we'll talk a brief about the upstream status. So before introducing what is shared virtual addressing, let's first try to understand why we need shared virtual addressing. Today's infrastructure platforms typically deploy a number of high performance accelerators such as general purpose GPUs, smart NICs, etc. These accelerator devices usually sit on the PCI bus and can have their own private cache and memory. The PCI bus provides the backbone bus architecture for most of the infrastructure platforms. Now these accelerator devices can also be virtualized by various IO virtualization techniques such as PCI pass through our emulation. In this diagram we see a very simple laid out ARM infrastructure platform. On the left we have a host system with a number of ARM CPUs. There's a CMN interconnect which provides the backbone interconnect to access the host memory by various masters. The ARM SMMU provides the translation for various IO devices. The PCI root complex can have a number of lanes to which various PCI devices can be connected. Now this is a very simple laid out diagram. When the accelerator device now wants to initiate a DMA it would request the host to provide a DMA buffer. The host software will prepare a DMA buffer and program it into the bar register of the accelerator device. Now one thing to note here is the accelerator device which will now be using this DMA buffer and the host have their own view of memory, the accelerator device and the host cannot work together on the same piece on the same address space. Now in this traditional memory model the programming complexity increases since any workload that has to be programmed by the host software into the accelerator. A separate DMA buffer has to be prepared. There's a memory copy often involved that also introduces the performance degradation. Additionally, today's infrastructure platforms also deploy a number of coherency protocols such as CCIX or CXL. In the absence of SVA, programmers are not able to take full advantage of the coherency. So what is shared virtual addressing? SVA is a technique that allows sharing the same virtual address space between the CPU and IO device. Both CPU and the IO device will work on the same virtual address pointer and can access the same physical memory as well. The SVA provides the device or the accelerator device the ability to perform DMA on the process address space rather than using a separate DMA buffer. In the diagram above, here we see the CPU and the PCI accelerator device have distinct virtual address space. The CPU uses MMU in order to do this VA to PA translation and this MMU walks separate page tables called CPU page tables. The PCI accelerator device on the other hand has a different IO virtual address space and the SMMU that provides the translation walks separate device page tables. SVA tries to bridge the gap here. It basically provides the PCI accelerator device the ability to use the same virtual address pointer as CPU. Both MMU and SMMU are able to walk the same set of page tables that can be prepared by the CPU. And therefore any workload that the program running on the CPU wants to offload to the accelerator device can simply pass the virtual address pointer to the device and start the DNA. Some of the adjacent methodologies such as OpenCL 2.0 shared virtual memory or CUDA unified virtual memory can make use of this shared virtual address. Some of the advantages of SVA that we see are there's no need for a separate set of page table for the device. The device can use the same CPU page tables for any address for the SMMU to do its address translation. There's definitely reduced programming complexity. Now there's no need of a separate specialized driver running in user space or the kernel to provide a DMF buffer to be programmed into the device. Sharing the data between the CPU or the processors and the device also becomes easier now. There are fewer cache and tailview maintenance operations since the CPU and the device are using the same virtual address space pointer. The CPU does not have to do any cache flush operation in order for the device to see the latest copy of the data. Now in a system that implements cache coherency protocols SVA brings additional advantages. It enables zero cache maintenance operation and it also allows the CPU and the device to work on the buffer at the same time using atomic operations. Now let's look at some of the hardware and the software requirements in order to realize SVA. Now in the first diagram we saw the basic pieces that are present in infrastructure platform are PCI device supported by PCI root complex, a translating agent such as RMAS-MMU and the various CPUs. On the PCI side, the PCI protocol or the PCI specification defines a set of protocols, namely address translation service, page request interface and process address space identifier. The ATS is a memory access type TLP packet that can be initiated by a PCI device. And this ATS allows the device to request for address translation prior to initiating a TMA. The page request interface of the PRA allows the device to request for any IUPage fault handling. So whenever the address translation fails as part of the ATS request, the device can initiate a TLP packet for PRI and the translating agent will try to serve this PRI request. The process address space identifier is an additional ID that is emitted by the PCI device in addition to the requested ID. This message helps the SMMU identify various translation tables for various devices that are assigned to virtual machines. On the ARM IP side, SMMU version 3 supports all these PCI protocols as part of the ATS support. SMMU v3 provides the address translation for incoming ATS requests. As part of the PRI support, SMMU v3 implements a PRI queue that is filled by messages coming for the page request interface. The SMMU v3 sub-stream ID support is analogous to the PESID ID and helps in identifying various translation regimes. SMMU v3 also supports nested translation with the help of its various data structures called stream table entry and context descriptors. We need support for nested translation since in case of virtualization, the device would usually want to do both stage 1 as well as stage 2 translation so that the device is isolated at the VM level as well. The stream table entry in SMMU v3 usually stores the configuration for stage 2 translation whereas the context descriptor provides the translation configuration for stage 1. On the software side, when looking at the virtualization scenario, various software pieces that are required in order to implement SVA could be we need support for virtual machine manager such as KVM tool that can allow hosting a number of virtual machines. In addition to support for KVM, there is additional support needed for VFIO as well as VARTAIU-IUMMU in order to realize SVA. Para-virtualized IUMMU or VARTAIU-IUMMU present in the guest provides the DMA remapping capability for devices that are running in the guest kernel. The VFIO framework or the virtual function IO framework provides the IO virtualization technique called PCI pass through or device assignment. VFIO allows any PCI device to be assigned to the guest directly. These various software pieces will also be explained in upcoming slides. Now, in order to understand the SVA flow, let us first try to understand it at the host operating system level. This will help in identifying the various pieces and how to put together these pieces in order to realize the complete SVA system. Any SVA system, for example, which is running in a host kernel environment, will involve three basic things. One is SVA binding, then page table preparations and lastly the IO page fault handling. SVA binding refers to the process of binding the device PESID ID with the process address space. This will help in referring to the right virtual memory address space for any incoming requests coming from the device. The page table preparation as part of this, the SMMU V3 context descriptors are programmed with the CPU page table information. These CPU page table information will now serve as the stage 1 table, will now serve as the translation tables that can be walked by the SMMU V3 now. The IO page fault handler that is running in the kernel driver should provide or should call the kernel page fault handler in order to update the CPU page tables. So that any page faults can be handled. Once these page faults are handled, the IO page fault handler also sends a command PRI response, which is a SMMU V3 command that is sent to the device as part of successful PRI request completion. Now on the right, let's look at this diagram that provides the flow of SVA in a host environment. A PCA class driver is controlling the PCA device. This PCA device can directly talk to the SMMU V3 hardware. This SMMU V3 hardware is controlled by SMMU V3 driver running in the host kernel. Now in order to start the SVA function, the PCA class driver will first try to enable various hardware features that are required in order to realize SVA. These functions can involve supporting the functionality for ATS and PRI. As part of this, the SMMU V3 functionality for ATS and PRI will also be enabled. Now once these functionalities are enabled, the PCA class driver will then request for SVA bind to happen. As part of this, the SMMU V3 will also try to program the PESID tables or the context descriptors with the CPU page table information. At this stage, the device PESID is also bound to the process address space and any incoming transactions that are originating from the PCA device and that contain the PESID information can be linked to the particular process address space. Once the device finishes SVA binding, the class driver can now initiate a DMA. It will first try to program a DMA buffer, which can be a user space pointer. Once the device starts a DMA, it will first send ATS request. Now in a case where the user space pointer is not a resident memory, the ATS request will fail and the ATS completion will be sent back to the device with a translation failure. If the PRI is enabled on the PCA device, the PCA device will then try to send a request for PRI. This request will go to the translation agent, which can be RMSMU V3. RMSMU V3 has PRIQ implemented and this PRIQ is filled with the information from this PRIDLB packet. Once the PRIQ is filled, there is an interrupt raised and this interrupt will then be serviced by the SMMU V3 driver running in the host kernel. As part of this interrupt handler, the SMMU V3 driver will try to invoke the kernel page fault handler so that the page fault can be handled and the CPU page tables can be populated. Once the CPU page tables are populated and there is a correct mapping for VA2 physical address space, a successful PRI response has to be sent to the PCA device. As part of this, the SMMU V3 will send a command PRI response and thereby the SMMU V3 hardware will send a successful page response to the PCA device. Once the PRI is succeeded, the device can send the ETS request once again and this time the ETS will be successful with the translation for VA2-PA and this will then complete the DMA workload. Now let us talk about the virtualization. Virtualization is one of the key enabling technology in infrastructure platforms. It allows sharing of resources between various virtual machines, be it IU resource or the CPU resource. Various IU virtualization techniques such as PCA pass through or emulation allows sharing the IU devices among various virtual machines. PCA pass through is one of the most common IU virtualization technique and it allows the PCA device to be assigned to a virtual machine. Any class driver that is running in the guest can control the PCA device directly and the user space applications can program various workloads running in the virtual machine. On the right side is a software layout depicting the software stack running at various exception levels on an ARM64 system. The hardware layer consists of PCA accelerator device as well as ARM SMM UV3. These two are controlled by their respective drivers that are running in the host kernel which is running at EL2 exception level. The host kernel also runs the KVM stub which provides support to host various virtual machines using a virtual machine manager such as KVM tool or KMU. The VFIO interface or the VFIO framework running in the Linux kernel at EL2 provides support for direct device assignment or PCA pass through and thereby the PCA accelerator device can be assigned to the guest kernel. The KVM tool virtual machine manager running at EL1 should implement support for this VFIO and it also needs to implement support for the back end IOMMU driver. A front end IOMMU driver present in the guest kernel running at EL1 provides support for DMA remapping for various device drivers running at guest kernel level. Any guest application running at EL0 can program the workload via the PCA class driver into the PCA accelerator device which is now assigned to this virtual machine. So some of the key components of this entire software stack is one is the IU virtualization technique or which is the PCA pass through and the VFIO PCIe enables support for that. The DMA remapping capability in the guest via nested translation is supported with the help of VARTIU IOMMU driver. The front end VARTIU driver is running in the guest kernel while the back end VARTIU IOMMU driver is running in the virtual machine manager. Now as part of this entire software stack any DMA faults also have to be handled and these DMA faults are handled with the help of the VFIO interface only. This software stack provides the basis for implementing SVA in the virtual machine environment. So the current proposed design of SVA is based on the VFIO and IOMMU API changes that were posted by Eric and a big thank to him for this work. These changes implement various VFIO and IOMMU user APIs that will enable supporting shared virtual addressing using a virtual machine manager. There's also an ongoing IOMMU user API proposal by Kevin which provides support for various user API implementations in order to request for various pieces that are needed for shared virtual addressing such as programming stage one page tables into the IOMMU hardware requesting for TLB invalidations and so on. This user API proposal also brings together the various pieces for iOS ID allocations and management interface. The changes that are done in what IOMMU driver should mostly be independent of the various IOMMU or VFIO user API changes. Some of the additions that are proposed to the IOMMU specification as well as the driver include separate set of requests for programming the stage one page tables for requesting TLB invalidations as well as sending the successful page response back to the host. And couple of feature bits have been added as part of this support. Now let us understand the flow of SVA in virtual machine as per this current design. Considering the DMM map use case where the device or where the device driver running the guest kernel would initiate the DMA on a PCI device that is assigned to the virtual machine. The device driver will first initiate a DMA and as part of that the what IOMMU driver will try to program the PESID tables or the stage one page tables with the CPU information with the CPU page table information. It will then send a attached table request to the underlying back end what are your driver the back end what are your driver running in the KBM tool will process this attached table request and will prepare a VFIO structure that contains all this information. This VFIO layer will then call particular I octals such as for setting the PESID table in order to program the stage one page tables. This call flow will trickle down to our MSMU V3 driver where the driver will pass the stage one page table information such as TTBR, the translation control register settings, etc. and will program these stage one page table configurations into the RMSMU V3 hardware. Now the stage one page tables are programmed and the DMA can be initiated. When the DMA is initiated and the PCI device sends the ATS request the ATS request would fail when the page is not resident. As part of that the RMSMU V3 hardware will raise a fault. This fault information will be passed by the RMSMU V3 driver and it will call all the masters to handle the fault information. The VFIO interface will try to pass this fault information and it will prepare a fault buffer information and it will program that into a VFIO region which is a new region that is added as part of the SV implementation by Eric. Once this VFIO region is programmed with the DMA fault information an event FD is raised. This event FD reaches the user space running at TL1 and the KVM tool VFIO driver will pass the VFIO or the DMA fault information and it will program the DMA fault information into the VTIO IMMU. The VTIO IMMU driver running at the KVM tool will program the VTIO rings with this fault information and it will signal the VTIO queue so that the front end VTIO IMMU driver can handle the fault information. The guest IMMU driver will pass this DMA fault information and it will invoke the kernel page fault handler so that the CPU page tables can be updated for this particular virtual machine. Once the CPU page table informations are updated a successful response has to be sent back to the PCA device. The VTIO IMMU driver and the guest will prepare a page response and it will send the page response request to the underlying backend VTIO IMMU driver. This will be translated into the VFIO structure and the VFIO will write that page response into the DMA fault response VFIO region. This write will be handled by the VFIO interface running at the host kernel and the fault information will be sent back to the IMMU driver. This IMMU driver, if it's a successful page response it will then send a command PRI response to the PCA device. The unmapped requests or the request for invalidations also go via the same flow where the VTIO IMMU driver prepares the invalidation requests and these requests go to the host RMS MMUV3 driver via VFIO layer. And the various TLBs that can be tagged with the PESID information can be flushed. On the right we see now a consolidated flow diagram for this DMA mapping in a VSVA implementation. The pass through PCA device will have to initiate a DMA but before that happens the VTIO IMMU driver will try to program the Stage 1 page tables and it will send the attached table request and the set PESID table IOCTL for VFIO will be called and the Stage 1 page tables will be programmed as part into the RMS MMUV3 driver. Now when the DMA is initiated with guest virtual address this address may not be resident into the memory and that's why when the ATS request comes to the RMS MMUV3 it will try to walk the Stage 1 page tables but in the absence of write mapping it will raise a fault. This fault information will be sent back to the VTIO IMMU driver running in the guest and it will try to populate the CPU page tables with the write mapping as part of the page fault handling. Once the page fault is handled the page response is sent back to the SMMUV3 with the help of new requests from VTIO IMMU. This page response information is parsed by the RMS MMUV3 driver and the command PRI response is written into the command queue of RMS MMUV3. As part of that, as part of the command PRI response a successful page response is sent back to the PCA device and the page fault is handled. Now looking at the current upstreaming status major work towards enabling the SVS support in guest kernel has been done in the word IUIMMU driver. Most of these changes are independent of the user API changes in VFIO or IUMMU. These changes mostly include support for nested page table and the support for handling DMA fault from the host kernel and to send the page fault response from the guest kernel back to the host kernel. Couple of patch series that have been posted sometime back are present here and the next version we are planning to publish soon. We are also planning to incorporate any changes into the word IUIMMU driver arising from the dependencies from the IUMMU user API proposal. The KVM tool changes are based on the word IUIMMU driver changes that are done by John Philip present in this branch. And a big thank to John Philip for his overall guidance in order to implement this SPA. That is all I had in the presentation. Thank you so much.