 Welcome everyone to this presentation where we will talk about VDPA block, unified hardware and software offload for VDPA block. I'm Stefano Garzarella, I'm a senior software engineer in the Red Hat virtualization team. During this presentation we will talk about VDPA block and we will start from goals and benefits. Then we'll take a look at the standard path, followed by bird eye or block request, and two possible ways to accelerate that path. At that point we'll take a quick look at VDPA, since there have already been several talks about it, and we will focus on the bird eye or block device accelerators. Finally we'll look at Gremium, especially at the block layer features that are bypassed when we use accelerators. Then we will propose a mechanism to automatically switch between the fast and slow path if we need Gremium storage features or not. The main goal and also a benefit of VDPA block is to have a unified software stack. This is really useful for the user perspective, so we have the same software stack for virtual machines, containers and bad metal applications. But also for vendor perspectives, since the abstraction of VDPA allow us to reuse a lot of code, regardless of the hardware vendor, and also to provide possible accelerators in software. The Gremium autoswitch feature will allow us to take advantage of Gremium storage features, such as image file formats and block jobs. We will focus on high performance implementation suitable for modern fast and VME drives. So if you are developing new accelerator devices, join us and take advantage of VDPA block software stack. You can find more information and useful links about VDPA at vtpa-dev.gitlab.io. Before talking about accelerators, let's see what is the standard path that a bird eye or block request follows. Starting from the guest kernel to the host kernel, the request must go through different layers, it must be translated into different formats, and is queued in multiple queues. So the request starts from the Linux block layer into the guest, then is queued in the bird queue by the bird eye or block driver. In Gremium, the device emulator receives the request and queues it in the Gremium block layer, where at the end is handled by the asynchronous IOU engine that can be Linux IOU or IOU. At this point, Gremium needs to do a system call to send a batch of requests to the host kernel, where Linux IOU or IOU handled them and forward to the virtual file system and the Linux block layer before hitching the device driver. A possibility to reduce the path that the request has to follow is to move the device emulation from Gremium to the host kernel, so we can use the be-host framework to implement an internal virtual IOU block device emulation. In this way, we bypass Gremium, so we will reduce the overhead since we have less layers to cross, but we cannot use the queemoon storage features anymore, so this approach is fine only for raw files or block devices. Several implementation have been proposed in previous years, but none have been merged upstream because they didn't show impressive performance. A similar approach, which we presented in a talk at last year's KVM forum, is based on IOUuring, the Linux interface between user space and kernel to do a synchronous IOU. The interface consists of pair rings, submission queue, and completion queue, allocated by the kernel and shared with the user space. What we tried last year was to map these queues directly into the guest memory, and we modified the word IOU block driver to use the host IOUing queue directly. We also took advantage of IOUing's polling features. We enabled sqpaul to avoid notification from the guest, and IOPaul to avoid interrupts from the device. The performance of the proof-of-concept of IOUing pass-through that we implemented last year was very promising. In the vertical axis we have KIOs, higher is better, and we compare FIO, running on bare metal, the last column on the right, with IOUing pass-through, and Vhost block with several optimizations. The initial performance of Vhost block was very low, but having some polling while processing the word queue and the device in the host kernel increased the low performance. With VDPa block software device, which is very similar to Vhost as we will see, we expect to be between Vhost block and IOUing pass-through. IOUing pass-through looks very promising, but requires changes in the guest. Vhost block was never merged and requires some work to optimize it. Since VDPa, the new framework to support Virtio device accelerators, was recently introduced, we decided to try to implement an internal software Virtio block device accelerator based on VDPa, because it should allow us to reuse a lot of code, even with hardware accelerators. Before we look at the details, let's take a very quick look at VDPa. VDPa is the acronyms of Virtio data path acceleration. VDPa device must provide a data path fully compliant with Virtio specification. The control path can be vendor specific, so a small VDPa driver in the host kernel is required for the control path. It was mainly designed for hardware accelerators, but the gray design and abstraction allows also software device emulated in the host kernel. The current implementation locks the guest memory, so the memory hover commit is not supported yet, but comparatively, it should be faster, since we don't need to handle page folds while accessing workloads. The main advantage of using VDPa is the unified software stack for all VDPa devices. The red boxes in the picture represent the parts of the code that can be reused, regardless of the device, software or by different vendors. VDPa provides a V host interface for user space or guest Virtio driver, like a VM running in streaming. It also provides a Virtio interface, so it behaves like a standard Virtio device, and we can directly attach to the host a Yo sub system using the standard Virtio drivers that we normally use in guests. So in this way, a VDPa device can be used by bare metal or containerized applications running in the host. VDPa also provides a management API through netlink to instantiate or destroy devices and to configure Virtio parameters. So, based on VDPa framework, we can develop a unified software stack to support software and hardware Virtio block devices. Guests don't need any changes, since the exported interface is fully compliant with Virtio block specification. The QEM code and the VDPa framework code in the host kernel can be reused for both software and hardware devices. The custom code that we need is a small VDPa parent driver for each hardware device, such as custom hardware by a vendor or smartNICs and FPGA. This driver will be used only for the control path. For the software device, of course, we need to implement the device emulator in the host kernel. We will base the implementation of VDPa block software device on our experience with IOUuring pass-through and the optimization of Vhost block that we have seen. The device will interface directly with the Linux virtual file system as Linux IOU and IOUuring do. This allows both block devices, physical NVME, for example, or a natural block device, but also raw files stored on a file system to be used as backend. Eventually, it can be used also as a fallback when hardware accelerators are not available. We will also support dynamic polling strategy that we have seen have helped a lot to increase the performance of Vhost block. Also, we will have a birth queue polling similar to IOUuring SQPoll. The device will poll for a while the birth queue to check if there are new requests. In this way, the guest can potentially submit IOU without the M-Exit. And we will support the IOU polling feature of the Linux block layer. This feature must be supported by devices and file system, which allows to do busy wait for IOU completion avoiding asynchronous interrupts from the device. The VDPA management API will provide a standard way to create and destroy the device and to set up the birth IOU parameters defined by the specification, such as the birth queue parameters, for example, queue size and the birth IOU block configuration. We will also need to provide custom API for the VDPA software device to attach it to a block device or a raw file. And also to set up parameters related to the implementation, such as to control the polling mechanism. All the accelerators we have seen so far, starting from Vhost block, IOUuring pass through and the VDPA block software or other devices bypass the QMU block layer. This can be fine when we want to take full advantage of the performance of a hardware or software accelerator. As we have seen in these cases, bypassing QMU, the accelerator have direct access to the guest queue for the best possible performance, avoiding several layers, translations and QMU. But often, we may need QMU to process requests because we need the storage visualization features provided by the QMU block layer, such as image file formats, IOU chuckling, snapshot, encryption, incremental backup and other user-foil features. So, what we would like to do in QMU is an automatic switching mechanism. We will use the fast path bypassing QMU when we don't need the QMU storage features and we need the best possible performance using VDPA block hardware accelerator or software device attached to raw files or block devices. As we have seen in these cases, the queues are directly exposed to the accelerator. But we may need QMU to process requests because, for example, we need a QMU storage feature or to overcome it, the guest run that is not supported for now by VDPA devices or we need to live-migrate a VM. At this point, we will switch from the fast path to the slow path where QMU starts to process the guest block queue. We are already available to do this because we have the Berdayer block device emulation in QMU. What we need to have is the interface with VDPA block device. A very interesting approach was already proposed by Eugenia primarily to solve the problem of live-migrating VMs with VDPA devices. In this case, QMU allocates a new birth queue which we called Shadow Birth Queue and expose it to the VDPA device. This gives QMU complete control of the guest to perform emulation but it also allows QMU to intercept requests and process them in the QMU block layer applying functionality requested by the user. For some cases, the use of the fast or slow path can be done at start time but the most interesting case is the runtime switch when we need, for example, to live-migrate a VM or because a storage feature is requested while the VM is running. For example, the user wants to set a your trot link. About runtime switch let's try to follow an example. Initially, we start using the fast path because we don't need the QMU storage feature so the birth queue is processed directly by the VDPA device. At some point an operation is requested by the user where we need QMU to process the birth queue. So we need to switch the slow path where QMU takes control of the birth queue but before doing this we should freeze the device state so we must stop the guest driver from queuing your request and wait for the VDPA device to complete all in-flight requests. When all requests are completed QMU takes control of the birth queue allocates the new shadow of the birth queue and expose it to the VDPA device. So at this point we are ready and we can restart the guest driver allowing to queue your request. After some time there are casted operation hands for example we have successfully immigrated the VM or the operation is no longer required. So we can switch back to the fast path doing the opposite of what we did before. We need to stop the guest driver again and wait for in-flight request to complete. Then QMU passes control of guest birth queue back to the VDPA device and at this point we are ready and we can restart the guest driver taking advantage of the fast path for the best performance. This was an overview of what we would like to do leveraging the VDPA capabilities. Currently we merged upstream a VDPA block simulator in the Linux kernel. It is a simple RAM disk that can be used to develop and test the software step. In the coming months we would like to develop a first implementation of the software in kernel device simulation especially to see if our performance expectation are correct. Then we will focus on QMU adding the VDPA block support and the auto-switching system between fast and slow path. We expect new hardware based on VDPA block to be released custom accelerators from vendors but also based on smartNICS and FPGA. All these things could go in parallel thanks to the simulator already merged so if you are interested in collaborating you are really welcome both software developers and other vendors. Thanks for joining us and take advantage of the VDPA block step. Thank you very much for attending this talk and now it's time for question.