 Hi, I'm Maxime. I work in the networking team at Red Hat. I mainly contribute to DPDK, where I maintain the VOS to the library and the Vertigo driver. And Adrian, who is joining soon, has been mainly working recently on VDPA enablement in Kubernetes. So, I will start this presentation by introducing Vertigo and VDPA technologies and provide an overview of the kind of VDPA framework. Then, Adrian will present how VDPA support is being enabled into Kubernetes, which will include an end-to-end demonstration. Finally, we will provide an update on the current status. And if time permits, we will answer to any question we will have. So, the Vertigo is an open specification that standardizes different type of interfaces for virtual machines. It defines the layout of the device and all the interactions that happen between a device and its driver. First, Vertigo has a notion of features which are negotiated between the device and the driver to enable backward compatibility. Features can either be generic to all device types, like for example features about ring layout or IOMU support, or features can be specific to a device type, like in the case of VDPA devices, the MTU feature, for example. The specification also defines the notion of VDPUs, which are the rings that I use to exchange data between the device and its driver. And finally, there is a notion of transport, which is mainly PCI in our case, but other buses are supported like MLIU. As of the Vertigo 1.1 revision, there are 24 different types of devices that are specified, and each one has their own specificities. In this talk, we will only focus on the Vertigo PCI networking device. So here we have the layout of the Vertigo Net PCI device. We can split it into two main parts, the control pass in red and the data pass in green. The control pass is composed of PCI bars, which include several structures, such as the command config. So this is a generic structure that contains fields used for the future negotiation, to specify the number of queues sponsored by the device, where we can find the device status, or the VAT queues addresses. Then we have also the notification related config, the device config. So this one is device type specific. So in our case, it is called Vertigo Net config, and its field will contain information such as the device MAC, the link state, or the maximum MTU. And finally, we have the ISR config, which is used to distinguish between normal VAT queues interrupts and device config interrupts. The Vertigo Net control pass is also composed of an optional control VAT queue, and it is used to configure different features related to the networking device, such as MTQ, MAC address or VLAN filtering. Then in green, we have the data pass. For Vertigo Net, it works by queue pairs, a pair being composed of one receive queue and one transmit queue. The specification mandates at least one queue pair, but MTQ is possible if requirements are met, both on the driver and the device site. Now that we have a better idea of the layout of Vertigo Net devices, let's see how it is handled in software. So in this slide, we can see two possible uses of Vertigo Net in the scope of virtualization. On the left, we have a full kind of solution where kind of Vertigo Net driver is used, providing a net depth to the guest. On the other side, we have the VOS backend, which implements the device handling of the data pass and it lives in the kernel. In this solution, QMU handles the control pass, which is both the PCI bar and the control queue handling. It translates it into VOS kernel protocol to configure the backend. On the right side, we have a full user space solution with DPDK, Vertigo Net PMD, which lives in the guest user space and DPDK VOS user backend in the host user space. As in previous solution, QMU handles the control pass, but this time, it translates it into the host user protocol. Both solutions are their pros and cons. The kernel solution is the default and more generic one, while the user space solution is more specialized for use cases, requiring high throughput and low latency. This is commonly used, for example, in NFT. But in any case, while these software solutions have a lot of advantages by providing standardized interfaces and by providing features like migration, it will have a cost in terms of resources. Utilization by the host, so it will let less resources to the end application. Also, even with full user space solution, the performance will be significantly less than when assigning SRIO VVS directly to the guest or to the container. The question now is, how can we improve the Vertigo performance while keeping the advantages? So the answer is by offloading Vertigo device handling to the hardware. There are two ways to offload Vertigo Net handling to the hardware. First one is full Vertigo offload. It means that both the data and the control pass are implemented in hardware. For example, in SRIO VVS. So the clear advantage of such a solution is, as you can see in the diagram, is the simplicity of the software side. You just have to do device assignment of the Vertigo device, like it is done today for regular SRIO VVS, which usually means binding your device to the FIO and passing it to QNU. While this solution is empty from a software point of view, it has several limitations. First, it means device live migration is not easily possible because the Vertigo specification does not cover the artificial tracking. Then it implies that the hardware vendor must implement the Vertigo control pass in its hardware, which might be difficult sometimes as maybe the control pass is not compatible when extending an existing device implementation. And more generally, it provides less flexibility. For example, let's imagine you find an hardware bug in the implementation of a given feature. You will have no over-choice of implementing quirks in the guest Vertigo driver to disable it. Which would break the standard driver promise. If we look at the availability of such devices, we can already find it in some bare metal instance of Alibaba cloud. We can also find it in the proof it to SmartNIC from Nvidia. The alternative solution to overcome these limitations is VDPA, which stands for Vertigo data pass acceleration. In this case, only the data pass is offloaded to the hardware, which is enough to address the full software implementation limitations. The control pass is handled by the host thanks to a dedicated framework, which is available both in the kernel and in DPDK. But in this presentation, we will only focus on the kernel framework. This framework can act as a translator between the Vertigo specification control pass and the hardware NIC control pass, which is vendor specific. It aims at providing back the flexibility that we lose with full hardware offload solution by making it easier for the hardware designer as the control pass does not have to be fully compliant with the spec. It enables doing live migration, either directly in hardware as the device implement the DPDK tracking mechanism, or assisted software live migration is such functionality is not supported by the device. And finally, in case a bug is found in the hardware design, it is possible to restrict features either at VDPA driver level or via the QNU command line, which both live in the host until it lets the guest driver and modifier. Hardware supported today in the kernel are the NVIDIA Connecting 6 NIC and also the Bluefinn 2 devices. We also have the Intel devices that use the IFC VDPA driver and soon also any full data offload devices I listed in previous slide will be supported as work is going on to provide the Vertigo VDPA driver for these devices. Now let's have a look into the details of the kernel VDPA framework architecture. The core of the framework is the VDPA bus. The goal of this virtual bus is to provide communication protocol to connect VDPA bus drivers and VDPA devices drivers. The VDPA device drivers are registered to the VDPA bus by the parent device driver. For example, in the case of NVIDIA, the MLX5 VDPA devices are registered by the MLX5 core driver. These device drivers implement a set of operations called by the VDPA bus to configure the device, such as callbacks to set and get VDPA features, callbacks to provide VQ addresses, VQ size, etc. This is where the translation from generic VDPA controls into vendor-specific controls happen. Among the available drivers, we find the drivers for the devices mentioned earlier and also a VDPA simulator. This driver is software-only and is used for testing purpose. Basically, it looks back packets from its transmit queue to its specific queue. On the VDPA bus driver side, we currently have two options. The first one is a VDPA bus driver. This diagram is a bit simplified. In reality, this green box contains a VDPA VDPA driver that plugs to VDPA bus, on top of which you will find the regular VDPA driver that you will usually find in guest channels. These drivers enable providing a VAT unit dev to the host. The overall option is VOS VDPA. Its goal is to provide a unified interface to user space applications, such as QMU or DPDK-based applications via the VAT user PNP. This driver is very similar to VOS kernel as it reduces most of its protocol, but it also adds a few protocol requests to set up things that are VDPA-specific, like DMA mapping or VAT unit config space configuration. Adrienne, do you want to switch to your presentation? Okay, so now that we know how VDPA works, we're going to talk about how it is integrated into Kubernetes. So, before we go into the details, let's ask ourselves, why? Why would we want to integrate VDPA into Kubernetes? Isn't Virtayo only about virtualization? Well, we see a number of use cases. First, as Kubernetes becomes the standard cloud orchestration technology, we're seeing more mixed environments where virtualized and containerized applications live side by side. For those cases, VDPA can provide a unified data plane solution reducing operational complexity. Qvert is a good example of this. Also, VDPA can provide accelerated interfaces to virtualization-based container isolation technologies, such as Cata containers, for example. Using Virtayo instead of SIOV in this case would enable smaller kernels, reducing memory footprint, boot time, and attack surface. And last but not least, VDPA can provide accelerated yet standard secondary interfaces to CNS. That way, they can use the vendor agnostic Virtayo user PMD and only certify their application once while keeping a good performance. So, let's look at this use case in more detail. So, CNS usually typically use SIOV interfaces to enable high-speed networking applications, right? Let's start from there. Next slide, yeah. So, this is a simplified view of an SIOV setup. So, we can see that SIOV devices can be consumed either directly by user-based applications, such as DPDK, if we use VFIO, for example, or by standard Linux applications if we bind a NetDev driver to the VF. Well, that is not different to what we would expect in the VDPA case. DPDK applications should be able to consume Vhost VDPA devices using the Virtayo user PMD, and standard Linux applications should be able to consume VirtayoNet network interfaces using the kernel stack. The only difference is that there are more layers of drivers in the kernel, but from the POTS perspective, it looks pretty similar. However, from the Kubernetes network orchestration point of view, we have found some limitations, and in order to understand them, we are going to present the Kubernetes SIOV ecosystem in a bit more detail. So, first, let's introduce the key players. First, we have the SIOV network operator. The operator creates VFs and binds the right drivers to them. Then we have the network device plug-in. It discovers VFs, makes them available to POTS, and then later on, when the POTS is allocated, it adds device specification into the POTS OCI spec. Also, it adds environment variables pointing to the PCI address that the POTS has been allocated. Finally, we need to add network information to the POTS, which is done through MULTUS or any other meta-plugging. These meta-plugging allow Kubernetes to call different CNIs for each network attachment, and in this case, we would use the SIOV CNI, who configures the networking aspects of SIOV VFs, such as MAC address, VLAN, etc. Also, if it's a net dev, it moves the interface into the POTS network namespace. So, let's see these key players in a bit more detail. So, first, we start with a node with an SIOV capable PF. The first thing that happens is that the SIOV operator comes in and configures this PF. It creates VFs, and it binds the right drivers to each VF. For example, here we will show one VF using a VFIO driver and another one using the net dev driver. Now the job of the operator is pretty much done, and it's the turn of the SIOV network device plugin. The network device plugin is deployed on its node and configured. The configuration tells the SIOV device plugin how to arrange SIOV resources into pools. For instance, in this slide, we can say VFs 0 to 3 go to pool 1 and VFs 4 to 7 go to pool 2. We can also filter based on the driver that is being used or in some other properties. The device plugin discovers the devices, creates the pools and tells Cubelet about them so that these resources can be allocated to pools. Now it's time to configure the SIOV CNI. We do this by creating a network attachment definition. This network attachment definition, in it we defined a secondary network and we can refer to an existing resource pool. For instance, here we can create a network called Extranet1 and that requires a device from pool, for example, pool 1. This way, Cubelet knows that when a pod request to be attached to Extranet1, it will have to allocate a device from pool 1. So now the system is ready to create new pods. When a new pod comes in, the SIOV device plugin is asked to allocate a new device. The response contains information that is attached to the OCI runtime spec of the pod. This way, the pod is able to access some of the node's devices, for example, the VFIO device of our VF. So it adds, and also additionally, it adds the environment variables, specifying the PCI address of the devices that it has been allocated. So now it's time to configure the network. This is drawn from Moltus. Moltus first configured the default network and then it calls the SIOV CNI. With the PCI address of the device, it has to configure. So then the SIOV CNI can set things like the MAC address or the VLAN tag on the VF and move the NetDev interface to the pod's network namespace. So now the pod has a network interface called NetOne. Finally, Moltus gets all the information from the CNI and writes it into the annotations of the pod so that the container, when it starts, it can access the annotations through the downwards API. In this figure, we show the use of a library called AppNetUtil, which we'll talk about later, but what it basically does is parse the content of the network status annotation and offer an ATV API to workloads. Easy, right? It's a nice diagram. So we've found that this approach has several limitations. First, the pod has very limited information. It only has the PCI address. We need more than that for VDPA. We need to know what Vhost VDPA device to use, for example. Also, the network status annotation does not have device information. So the pod does not know what VF is associated with but with which network interface, which network attachment. The CNI has very little information also. Finally, VDPA provisioning has some extra steps and some of those steps actually might change because there are certain efforts still going on in the kernel community to develop some management tools. So in order to solve these limitations, we created the device info specification. This spec defines a way to share device information in a standardized JSON file. So the file is created by the entity that creates the device. In most cases, this is the device plugin. Moltus understands this file and adds it to the network status annotation. Thus, binding the network information with the device resource information and making it all available to the pod to consume. This spec has been hosted by the Kubernetes Network Plumbing Working Group and you have the link down there. Also, we have improved the network device plugin on the CNI by adding support to the device information spec, adding support for Vhost and Virtio VDPA devices and adding a new selector that can be used to filter devices based on the VDPA driver. The new selector is called VDPA type and also finally just to mention that the low level VDPA management has been moved to an external library so that we can keep pace of the work that is being done in the kernel. Also, we have improved the library that I mentioned before, AppNetUtil. It provides a native API to workloads based on C and Golang and it now supports the device InfoSpec and VDPA. Let's go back to our diagram. This was how SIOV device assignment worked in Kubernetes. When we add the device InfoSpec and VDPA into this, it turns into this other diagram. Now, the device plugin configuration tells the device plugin what kind of VDPA device it has to deal with. The device plugin discovers and filters VDPA devices and adds them to the pools. Also, the device plugin writes the information, the device information into the device Info file and this file is then read by MOLTUS and added to the network status annotation. The rest of the system behaves in a similar way. So, pods can access, in this case, Vhost, VDPA char devices and Verteionette network interfaces. This is what we're going to demo now. If you can play the demo for me, make some noise. Thanks. In this demo, what we can see is how we are running on a cluster. The cluster has essentially two nodes. We're running these commands on one of the nodes. We're using Connectix 6DX for pre-configured, pre-created VFs. Now, we're going to inspect the VDPA devices. The VDPA bus is very similar to other buses in the kernel, so it can be inspected through the CISFS API. So, we can also inspect which drivers these devices are bound to by just using this CISFS API. We see that we have two devices bound to the Verteio VDPA driver and another two bound to the Vhost VDPA driver. Now, we'll inspect the device plugin, Australia VDevice plugin config through its config map. It has three pools configured, but only two will be used in the demo. Basically, they use this new filter to select the VDPA devices. So, Vhost VDPA devices will go to one pool and Verteio VDPA devices will go to another pool. Now, let's make sure that the device plugin is running on both nodes and that the device plugins have detected these devices and have added them to Cubelet. Using this command, we list the allocatable resources known to Cubelet. So, we can see there that we have our two resources. Great. So, the device plugin should have also created the device information files for us, so we can check that because they should live under a standardized file system path. So, we can see our four device info files there. We can check the content of one of them, for example, and see that it's a JSON file containing the necessary information that will be needed to consume this device. In this case, it's a Vhost port. So, now we're going to configure the CNI and we're going to do that by creating a network attachment definition. One thing to note here is that the network attachment definition is a completely standard one. It has no changes required to support VDPA. So, we're configuring a VLAN tag and a trusted mode, et cetera. We'll create two of those, and now we are ready to deploy our DPDK app. The DPDK app we're going to run in this demo is a sample DPDK application running test PMD transmitting packets and also test PMD receiving packets. So, let's deploy it and let's check on the logs of the generator and the receiving end. So, we see that packets are flowing and packets are flowing from the generator and also packets should be getting into the test PMD that is receiving packets. So, we will show just in a second what the DPDK command line looks like on any of the nodes. Just to show how AppNetUtil was able to generate this command line. So, we see that it's using the test virtual user PMD and, again, this command line was generated just from information that was inside the pod and to verify that we can describe the pod and look at its annotations. So, there we see that the annotations have all the needed information, the networking and the device information nicely tied together and available for the pod. So, we have five minutes left. So, Maxime, if you want, we can skip the last part of the demo which is available, will be available in slides and we can go straight to the last slide in which we can talk about the current status and next steps. So, back to you, Maxime. I'm not muted. So, about the current status, for the lower layers we have the end-to-end solution which is available upstream from Karnel to both QMU or DPDK applications. But still there are some gaps and work is going on or is planned to address these gaps. There is a VDP management tool which is based on DevLeap that is being discussed upstream. There is also the control via Q-Support that I mentioned in the beginning of the talk that is also being... that has to be added to the Karnel and when it will be done we will be able to implement it both in QMU and battery user PLD. And finally, on the DPDK side we will add the config space written request of the Vost VDP protocol that we are missing which will be added in the Vost user PMD and for the company to be able to get the MAC address directly from the hardware instead of passing that via the annotations. Right, and in the Kubernetes angle we are currently working on the SRV network operator stuff and also some of the vendors will support VDP in shoot-step mode so we are working on other CNIs like the upcoming Accelerated Bridge CNI in order to support shoot-step cards on secondary interfaces on top of which we will add IPA support and also, well, there is OpenStack support coming in and that will be it. So I see we don't have a lot of time remaining. Thank you very much. We also have two questions from Thomas. One of the first question is can you do the same with the with sub function instead of the F and Thomas is also wondering if Cata is a must for SRIOV So for sub function so if people are working on it it will be supported in the future and for Cata no, the Cata is not a must it's just that VDP makes a lot of sense in the scope of Cata Continents. All right.