 Hello everyone, this is Jason Wang from Red Hat. So today I will be talking about how to achieve hyper scalability with VDP. So here is the outline. First we will review the VDP IP texture and then I will discuss the demand for the hyperscalability and its four major challenges and then I will give a status and summary. So from hardware perspective, the VDP device contains the following parts. First it should contain all the vertical features and both cues and then it may contain the host features and what's more important, all the stuff was done in a very specific config and it also allows the vendors to have some add one features. So we have several VDP devices right now. For example, we have Intel N3000 which is trust on a normal PCI devices or virtual function. It can do both block and networking and we also have Manok CX6 which is a VDP implementations on top of the specific hardware and architectures. So it only supports networking so far and third is the virtual PCI. So in that we can treat the virtual as a vendor. So technically it supports all whatever devices. And we are also developing a user space VDP device which is called VDIUS. So the work allows you to implement VDP devices in the user space. The work will start from the block devices. And what's more interesting is that we also have VDP simulators which is currently VDP simulators and we have the plan to make it ready for the production environment. So it supports block and networking. And we are pretty sure that there are more VDP parents on the road. So from the working architecture you can just see the VDP as a can solve the transport. So that's the VDP software architecture. So you can see that the cross-section concept is the VDP bus which allows several different VDP drivers and VDP parents to be attached. So in order to support the user space driver we introduce the VHOS VDP driver which connects VDP devices to the VHOS subsystems and present her simple virtual devices for the user space driver. So the major use case for this is the virtualization and it can also serve for the DPDK applications. And from another part is the VHOS VDP bus driver. So it connects the VDP parents to the virtual kernel drivers. It allows the applications to use the VDP devices as if a virtual devices. So the main use case for this is for the containers and bare metal. So as the density of the workload draws, so we see the demand for the hyperscalability. For example, we see the containerized workload becomes popular. So usually the cloud wonders requires to scaling the VDP instance to 10K or even 1000K. And we also see the requirements for fine-grained units as a VDP. For example, we may want to split or slice a VF. And then we also have this flexibility. That means the VDP hardware instance would be provisioned. And hardware should have the ability to group their units dynamically. So this can save a lot of software resources. So here are the main challenges. So first is how to achieve the fine-grained on a lightweight basis. And then we need to provide a secure DME content for each VDP instance. And we also need to try to sync how to scale the interrupt. So the major limitation is the PCIe. So it only allows about 10K, sorry, 2K of MSX entries. And also we need to provide the interface for the management layers to provision the VDP instance. So we will first discuss how to achieve the lightweight VDP instance. So we have some basic methodology. So the first is the VDP instance should occupy as minimal resources as possible. So for resources, it probably means both the hardware sources or transport specific resources. Then we also believe that the software can scale better or easier than hardware because we have much more flexibility in the software than the hardware. That usually means some kinds of limitations in the software is a must. So I will use the VortenNet as an example. So let's see how it was implemented in PCIe. In this figure, you can see that there's a gas open systems, which has a VortenNet driver. It uses the transport driver to talk to the rest model that is implemented in the VMM. So in our cases, it's the VDP subsystem plus the QML. And the VDP subsystem will talk to the real hardware. So in this case, it's VortenNet PCIe devices, which implements the interface for the basic facilities through the PCIe bar capabilities and PCIe link. So if we try to scale VortenNet PCIe instance, it's something like this. So usually, you have more VortenNet PCIe devices, which each VortenNet PCIe devices will have a PCIe bar capability. And each will have a TX and XQ pairs and an optional CVQ. So alternatively, we can do that VortenNet VRVDPA. So it's something similar to the VortenNet PCIe hardware. But the difference is that instead of using a PCIe transport, we can use a vendor-specific control or transport to replace the PCIe bar capability in this case. So the first thing we can do that is try to save a hardware-controlled queue. That means we can present a software-controlled queue in the VDP subsystems. So that means the CVQ features could be implemented in a vendor-specific way. So the VDP can still let the gas assume that it has a controlled queue. But in this vendor-specific driver, it will translate the controlled queue commands to the vendor-specific interface. So that means we save one hardware-controlled queue. So you can see that in this case, if we want to scale the VDP instance, there will be no need for the controlled queue in the hardware. So those resources could be used for more TX and XQ pairs. So another approach is to introduce the concept of the management device. Management device provides vendor-specific control for transporting the managed device. That means there's no direct vendor-specific control for the management VDP. Instead, if the driver or the opening systems try to configure a probe, the VDP instance must be done via the management device. So this help to keep the VDP instance as minimal as possible. But it introduces the complexity in the management device. So if you know that, you can see that it's just a balance. And another drawback is that it probably complicated the software path, which means we need some synchronization in the management device driver to synchronize the concurrent request for configuring the management VDP devices. So that also means we probably need some QoS in the management interface. So you can see that through this way, if we scale more VDP instance, all the transport-specific commands will be routed to the management device driver. And the management device driver will talk to the management functions that is implemented in the vendor-specific management device. And the management device will in charge of sending or implementing those functions or dispatching the semantics to the management VDP. So you can see that in this way, there's no direct vendor-specific or transport-specific resources to be allocated to each management VDP devices. It can help to save some resources. So based on our previous discussions, it's not hard to infer that we can add those support in the virtual spec. So basically, it means we need to introduce the transport-specific support for device command capabilities. And or we can also introduce the management device in the transport-specific way. Or we can also introduce the wood queue as a transport. That's basically something like the management queue. So here's the device-specific command capability. So you can see that we can introduce a new capability, which is used to accept the commands from the driver. So the actual commands and command-specific date is device-specific. But for what on that, it's not hard to imagine that this is a replacement or alternative for the control queue. So with the hardware implement this capability, there's no way for them to present a hardware control queue. But you can see that it's just a partial transport because for control queue, we can have a rebel-length of the command-specific date, which cannot be done in the PCIe bar. So it means we have less flexibility, but usually the 256-byte command-specific is really sufficient. And it also means that we save one wood queue. We can also introduce the managed device capability in the warehouse bag. It's as simple as to introduce a new capability. And in that capability, we've introduced a device selector, which means which kind of the managed device that all the PCIe capabilities refers to. For example, if we want to configure rating the wood queue zero address for the managed device one, first we need to write the one to the device selector, and then we need to write zero to the queue selector. And then we write the actual queue address, where the queue descriptor low, and also the queue descriptor high. And this interface can also be used for configure rating the management device itself. So that basically means we just write zero as the device selector, and then we can keep or configure the management device as zero. So as you can see that we may implement the management water PCIe driver. We need to route all the configuration and probing command from the VDP to the management water PCIe driver. And then the management water PCIe driver will talk to the management capability that is implement in the water PCIe devices. This helps to save all the transport specific resources. Alternatively, we can have wood queue as a transport. It means we will have a dedicated wood queue for the management device. So here's the basic layout of the commands. The most important part is still the device selector, and then we will have the class command and the data that is specific to each command. So the commands is basically used for transport, the basic water facilities for the managed device. And the management device is still probed in other transport such as PCIe. So the advantage of this approach is that it was not specific to any current PCIe transport. And it has more flexible than the bar or capability-based approach because the length of the data is variable. But it has several disadvantages. For example, it's more complicated and it may requires the quality of services because it could be used by several thousands of the VDP instance in parallel. So here are the basic commands that will be implemented in the management queue. So you can see that we will introduce the command for getting the features, getting a setting of the device data, the rest generation configurations, and some more queue configurations. And also what's important is that we need to set and get the MSI-X entries per managed device. And we also need the device to implement the notification area which could be mapped directly to the user space. So you can see that if we adapt the management queue all the device configuration request will still route from the VDP to the management of the PCIe driver. And then what management will be for the driver to talk to the management of the queue. Okay, so we can talk about the second challenge. That is how to provide a secure DMA contact for VDP. So the main requirement is to isolate DMAs among VDP instance. So there are three methods. The first is to leverage the transport or platform specific method. The second is to reuse the vendor specific facility. And last is try to isolate it at the hotel level. So for the transport specific method we can take PCIe as an example. It's not hard to imagine that we will use the processor specific ID as a PASID. So it means they probably need to assign PASID per VDP instance or even the PASID per VDP instance per queue. That means we need to platform LMS support for the PASID. And VDP vendors is in charge to implement a vendor specific way for configuring the PASID. So guys, it's not hard for the vendor because it can leverage the platform features. But the vendor needs to wait for those features to be implemented in the vendor first. And it's also the platform dependent. For example, the PASID is tied to the PCIe. So another method is to vendor specific method. So an example is to have the device MMU. This means device has its own MMU which is used for translating the IO virtual address with transport specific DMU addresses. That means DMU could be isolated as the at the VDP instance level. We can tag different VDP instance as different address space. So the isolation was done in the device itself instead of depending on the platform specific MMU. And it can choose to work with and without any transport specific DMU method. For example, it can co-work with PASID or not. So the advantage is that it's a platform independent and it's much more flexible. But the drawback is that it's a little bit complicated for the vendor to implement it. So actually we can borrow those ideas and implement those functions in the spec. For example, we can add spec support for the transport specific DMU isolation method or we can even introduce the MN isolation set to water level. So for the transport specific support, we can add the spec support for the PASID assignment. So it's as simple as we can introduce a PCI PASID capability and then we can introduce an interface for how to configure it in the PASID per queue. So in that case, we can align PASID for each water net instance or each work queue. And then if it was used with a PASID capable MMU, we can achieve DM isolation at VDP instance level. So it's simple because it leveraged platform features and it's also standard because it was in the spec. Some drawback is that it was platform independent but it's probably not a big issue. Or we can even consider to implement the device MMU in the spec. So the DMA is translated into two state. The state one is the translation is done in the device MMU which translate our virtual address in the intermediate address. And the state two is the platform MMU which translate intermediate address to the physical address. And we can also have the two possible interface. It could be a queue based or pay table based. So the opportunity is that it's platform independent. It can work with any transport. So it's more flexible but it will be more complicated for the implementation. So another interesting topic is how to scale the initial art. So we may meet some transport limitations. For example, in the PCIe, it only allows 2K MSIX entries per PCIe devices. And for the MMU transport, it does not even have MSIX support. So what we want to see is first to introduce the MSIX support in a general waterway. Or we can try to scale the number of MSIX entries. So for BGP vendors, it's basically about how to start MSIX entries in a vendor specific way. So it means we need to introduce the interface for the drivers to config to mask or unmask the MSIX entries in a vendor specifically instead of using the standard PCIe MSIX table. So for suspect, it's as simple as introduce MSIX configuration capability. And in that capability, we will have the functions to program the MIS vendors per device per board queue. Okay, so the last requirement is the VDP provisioning. So you can see that in the VDP framework, we have implemented those provisioning interfaces via the net link. So the management layer is in charge of specifying some attributes during the device creation. So the provisioning will be done at the VDP parent level. So they can build ideas and implement those in the virtual spec. So it basically means we need to implement or extend the current managed device capabilities by allowing the device provisioning to be done via that interfaces. So we can introduce some dedicated registers for storing the config. And we can introduce the registers to create or destroy the devices. And we can also get the command status from a dedicated registers. So for example, if we want to create a VDP on what help devices, first you need to write those configurations to the config registers. And then you can write a one to the creative users. And then if you read zero from the creator, okay, it means that the managed with what help devices has been successfully created. Okay, so here's the summary of these discussions. First, we discuss several approach to scale the VDP instance, to scale the secure LML contacts and interrupts and provisioning. And it does not means that they are only the approach for achieving the highest scalability. So other technologies may also help. For example, we can introduce the technologies like the restraining of scheduling, which means we can schedule several different guest or containers among a single VDP devices. And we can also leverage the transport specific way to use some technologies such as the shared wood queue, which means a single wood queue could be used that several different DMAs, DMAs. But none of those approach comes for free. So it's the charge of the vendor to balance the pros and cons and it shows a suitable ways for having a hyper scalability. Here are some references for the RPCs that has been posted on the list for the discussions in this talk. You are welcome to review and comment on this series. And we have prepared a GitLab website which contains all the necessary information for VDP. Please visit the website for more information. Thanks.