 Hello everyone, my name is Hao Wu. I'm from Intel virtualization enabling team. This is my topic today Scalable work submission in device virtualization Okay, let's start This is the agenda of this topic First of all, I would like to discuss scalability in virtualization for different types of device include the dedicated work queue and the shared work queue based devices and also related challenges And then I will introduce the new EQ command instruction on internal platforms and also its virtualization support virtualized scalable work submission in device virtualization Then I will use shared virtual dressing SVA workload submission as one example to demonstrate how the whole picture looks like with EQ Okay, let's start with the first one Scalability in device virtualization There are two devices in this page In the left side, this is PCIE SRV based device, it supports multiple virtual functions Each virtual function is independent interface which can be designed to different virtual machines In the right side, it's an Intel Scalable Iov device It supports multiple ADIs, design for device interface And each ADIs can be designed to different virtual machines As dedicated work queue is implemented in web apps and ADIs of these two devices So they provide the scalability by hard partitioning the hardware resources It will be quite difficult to increase the number of virtual functions or ADIs due to limited hardware resources on some devices This is a scalability limitation for dedicated work queue based device So we implement shared work queue in device state in case that we cannot do hard partition for device resources And if we really do that, how to ensure scalability in device virtualization Okay, let's see how shared work queue based device looks like here This shared work queue is used by multiple users in host And the typical usage of shared work queue is to support shared virtual addressing, SVA SVA allows device to use the CPU virtual address for DMA operation And device also can use the passive ID, process address space ID to distinguish the context of different workloads And DMA address translation will be performed by LMU at the requester ID plus pass ID granularity Shared work queue in device virtualization For sure, we can put shared work queue into a virtual function and then assign it to a virtual machine to allow different users inside the virtual machine to share this interface just like device in the left side But we also can share the same device interface to users in both host and virtual machines as device in the right side Actually, there is no hard limitation on the number of users for shared work queue So it can provide better scalability by adding more and more virtual devices on the same shared work queue Of course, one device can implement a dedicated work queue and shared work queue together Challenger in device virtualization As shared virtual addressing SVA is used, then we are facing a challenge here Users in virtual machines are not aware of host pass ID So workloads are submitted together with guest pass IDs But device and LMU are using host pass ID for the DMA operation So how to convert guest pass ID to host pass ID in device virtualization And new instruction eco command is introduced to address this gap Okay, we will move to eco command instruction introduction Inq command is a new instruction on inter-planform set automatically submit workload to device It opts pass ID from iOS 32 pass ID MSR and then inq store 64 byte command data to inq register Implemented in device MMO This is the format of the command data from spec It includes the device specific command area which can just be a work descriptor And together with pass ID which indicates the context of this work iOS 32 pass ID MSR is managed by accept feature set as the pass ID supervisor state component and updated by it per context switch After SVA process binding, iOS 32 pass ID MSR will reflect the pass ID value associated with the process So when this inq command instruction is used by user space application, it doesn't need to worry about the pass ID As you command the instruction, it opts it from the MSR automatically And this mechanism also can prevent malicious pass ID to be written to device from the user space application Inq command is a non-posted instruction which carries a status spec if the command was accepted by device or not For example, submission will fail to be accepted by the device if the shared work queue is already full This allows the submitter to know the submission status and perform retry if needed Inq command supervisor is similar to inq command instruction but only works in kernel space and it opts pass ID value from command data directly Device requirements for your command The differable memory write is a non-posted request defined by PCIe spec In order to support inq command instruction, device needs to support differable memory write completer capability and requires all speech ports and root ports have a differable memory write routing enabled In data streaming accelerator is the first device which supports inq command This is the latest driver patch set under review Here is the link to the submission Let's move to inq command virtualization First of all, I want to introduce the non-route mode operation of this instruction Actually, inq command has different operation in non-route mode Inq command and inq command supervisor obtains the guest pass ID firstly and then it will perform guest pass ID to host pass ID translation automatically and then inq store command data with host pass ID to device As non-route mode inq command performs the guest pass ID to host pass ID translation automatically then it can address the gap we discussed in previous page on SVA workload submission Let's move to the next page for more details about pass ID translation Pass ID translation is the new feature introduced in VMX for inq command virtualization It's enabled by setting secondary process-based VM execution control and pass ID translation table is required to be linked by pointers in the VMCS Pass ID translation table has two level data structure as you see in the right side Pass ID low and high directory and pass ID table Different fields of guest pass ID are used to select the pass ID table entry which contains the associated host pass ID When inq command runs in non-route mode, how will use this table for translation? If it fails to translate the guest pass ID, VM exit will be triggered KVM is required to manage the pass ID translation table for inq command virtualization KVM needs updates translation per IO address-based ID events I always said the manager's host pass ID and its association to guest pass it It will notify users on the pass ID status change So KVM monitor I always said buy and unbuy the events for translation update You can find more details about this in another KVM for your talk which title is pass ID management in KVM Here is a link to that session Pass ID translation table is a program table shared by all VMCS Any modification to this table must be a relatively new operation That means KVM needs to kick all the CPUs into root mode and block VM entry until modification is done This is a requirement from SDM when modified data structure which is referenced by pointers in VMCS and controls non-route mode operation Pass ID translation failure handling Translation failure only happens when guest is using an invalid guest pass ID Any guest pass ID only can be used for TML operation after a host pass ID is associated to it The reason is device and IMU always use host pass ID for TML operation They never know about guest pass ID So in this VM exit handling just set the Jeff flag to 1 to indicate the failure to guest and skip the instruction Pass ID to pass MSR virtualization As mentioned above in INK will command the non-route mode operation INK command obtains guest pass ID from RA32 pass ID MSR So just press through this MSR to the guest directly As this RA32 pass ID MSR is managed by Xsave So in the app of virtualization support for Xsave pass ID supervisor state component So that guest can use this extension to update the MSR per context switch Okay, these are major changes required to support ECO command virtualization Next, I will use xsave workload submission as one quick example to demonstrate the flow This is the example that user uses in QCM to submit xsave workload in guest Hope this can give a basic idea on the workflow In case guest wants the device to perform some right data to virtual buffer It needs to prepare or work descriptor with target buffer information including the address So guest virtual address is filled into the work descriptor Then guest application can run ECO command to submit this workload to the device directly The guest pass ID will be filled from RA32 pass ID MSR and translate automatically to the host pass ID And the work descriptor will be stored to device together with the host pass ID The application needs to check instruction status to make sure the submission was accepted by device If device accepts this workload, it will perform a DMA operation using GVA and payload information indicated by the work descriptor And also the host pass ID IOM menu will do the DMA address translation per RA32 pass ID So the GVA will be translated to HPA for the DMA operation This is just an example Okay, reference There are some reference documentation including kernel docs should virtual addressing with ECO command And also ECO command spec, Intel scalable LV documentation And also Intel data streaming accelerator spec This is development status Currently we already have ECO command native support merged into the latest kernel IOS seed extension for notification is under review V3 was submitted, this is the link to the V3 pass it For ECO command virtualization support will submit soon And the code is under internal review now Line migration support is not covered yet, so it is in the to-do list now This is a summary of this talk Dedicated work queue based on how the partitioning of the resources has scalability limitation in virtualization Share work queue with ECO command support allows more scalable usage in device virtualization And same device interface can be shared by multiple users in host and virtual machines Additional hardware support is required to support ECO command virtualization For example, pass ID translation, Excel extension for pass ID And corresponding changes in VM is also required to support ECO command virtualization Okay, that's all for my sharing of this talk Thanks for watching this