 Hello everyone, and thank you for attending this talk. I'm Stefano Garzarella, I'm a senior software engineer in the Red Hat Virtualization team. Today, we'll take a look on how to speed up VHEM's I.O. sharing all I.O.Euring queues with guests. This is the agenda of the talk. First of all, we'll take an overview of I.O.Euring, looking at the system codes, how the queues are organized, and some interesting features like resource registration and polling. Then, we'll look at QMU and how we use I.O.Euring. At that point, we'll see how to speed up VHEM sharing I.O.Euring queues directly with a guest. We'll also see some alternatives to this approach, like VHOST block and VDPA block, and we'll compare them. Finally, we'll talk about next steps. I.O.Euring is a new Linux interface between user space and kernel to do asynchronous I.O. It's not only oriented to block operation, but it's evolved like a generic interface to do asynchronous system codes. The interface consists of a pair of rings allocated by the kernel and shared with the user space. One ring is used by the application to submit new requests to do, and it's called submission queue, SQ. The other ring, called completion queue, CQ, is used by the kernel to return to the user the result of the submitted request. There are three system codes exposed by I.O.Euring that we are going to explore. The first one is I.O.Euring setup. It's the first system code to invoke to setup the context for performing asynchronous I.O. Several flags and parameters can be specified, such as the ring size. It returns a file descriptor that identifies the context and it must be used with the harder system codes. The second one is I.O.Euring register. This system code is also used during the initialization phase of the rings, or even afterwards to change registered stuff, but it's not used in the critical part. We'll talk more about it in a few slides. The last one is I.O.Euring enter. It's the most used system code during the lifecycle of the context, because it's used to initiate and or to complete asynchronous I.O. So with a single system code, we can submit new operation to do and re-eep operations done using the rings that we are going to see in the next slides. The submission queue is used by the application to submit new requests, producing a new SQ entry, SQE, that contains the operation to do and its parameters, like file descriptor, buffer address, offset, etc. When the application has one or more SQE release, it increases the tail and calls I.O.Euring enter to pass control to the kernel. At this point, the kernel consumes SQE, updates the head, and it schedules the works to execute the operation requested. The kernel will process the operation, schedule, and when they are done it prepares a new SQ entry, SQE, for its submitted request that contains the result of the operation requested. The SQE also contains the same user data value specified by the application in the corresponding SQE, so it's an opaque value for the kernel and can be used by the application to match the result with the submitted operation. The kernel increases the tail of the completion queue when it adds new SQE, and the application will consume SQEs, moving the head and checking the result of the operation submitted with the same user data. For each request, the kernel must take an internal reference to the file pointed by the file descriptor and release it when the operation is done. It also needs to map and un-map every time for each request the user buffer in the kernel virtual memory. In order to reduce the overhead for each request, if the application has a set of file descriptors and user buffers, user very often, we can pre-register them with the higher u-ring register system code and use an index in the SQE. In this way the kernel already has the reference and it used that index to get it. This system call can be used also to register other resources like an event FD to receive notification when some request is completed. Or it can be used to probe by u-ring to get information about the code supported by the running kernel. It can be also used to register personality to issue SQE with certain credentials. Or, as we can see later, to register restriction and enable ring processing. Another cool feature provided by i.o. u-ring is the polling. We have the possibility to enable the SQ polling and the i.o polling. In the first case, a kernel trend is created to poll the submission queue, avoiding the needs of system call to pass control to the kernel. An idle time is configurable, so if the kernel trend is idle for more than a configured time, it goes to sleep and the application must call i.o. u-ring enter with a special flag to wake up the kernel trend. When this feature is enabled, potentially the application can submit and re-request without doing a single system call. We can also enable i.o. polling. Doing busy wait for i.o. completion instead of waiting for an asynchronous notification, such as an interrupt from the device. This feature can be used only if the device or the file system support block i.o. polling. Starting from QEMU 5.0, released in April this year, i.o. u-ring is available in the QEMU Asynchronous i.o. subsystem. Thanks to Arushi, Julien and Stefan, we have a new 8.io engine that we can use with a dash drive option. The engine will do the standard block i.o. operation read, write, f-sync in the synchronous way using the i.o. u-ring queues and operations. Now let's go into the main topic of the talk. So how to speed up block i.o. in QEMU. This is the starting point. We have a virtual i.o. block device emulated in QEMU that uses the i.o. u-ring a, i.o. engine to do block operation. So we have two communication channels, a virtual queue between guest kernel and QEMU and i.o. u-ring queues between QEMU and the host kernel. So there is a kind of translation made by QEMU from word queue descriptors to i.o. u-ring queue handries and vice versa. So if we don't need the features of QEMU block layer, for example if we are using row files or devices, we can bypass it and pass through the i.o. u-ring queues directly in the guest kernel memory. So to realize i.o. u-ring pass through, the submission and completion queues are mapped in the guest memory and we modify the virtual i.o. block driver to use this new short path instead of word queues. It will submit and re-heap requests directly from the sq and cq-rings. We used a registered event fd in i.o. u-ring to inject interrupt in the guest when there are new cq available. Also if we are implementing polling strategies where we disable these notifications. About polling we used a set of patches developed by Stefan and Oxy to enable the device polling through Linux block i.o. poll interface in the virtual i.o. block driver. We modified it to poll the completion queue in order to avoid interrupts. In the host we enabled the sq polling to avoid notification from the guest, reducing the VM exit and we also enabled the i.o. polling to avoid the interrupts from the hardware device. In order to share submission and completion queues with the guest we needed some changes in i.o. u-ring. The first one was a way to enable and disable event fd notification at runtime. We used this feature to disable interrupt in the guest when we are polling the completion queue. The second change are the most important part. We needed a way to restrict the operations allowed in an i.o. u-ring context to safely share the rings with untrusted processes or guests. I put a link to a good article on lwhand.net about this feature that we will discuss in the next slides. The last change concerned memory translation because i.o. u-ring expect host virtual hardresses but the driver in the guest use guest physical hardresses. So we need a mechanism to register the memory mapping allowing i.o. u-ring to translate these hardresses. Unfortunately this feature is not yet available. As we saw we had the possibility to restrict the i.o. u-ring queues to share them with the guest. For example we don't want to allow a guest to use all file descriptors opened by 2MU or to do any kind of operations. We want to enable only same operation like read, write, have sync on a subset of file descriptor. So with the i.o. u-ring restriction feature we can install and allow list on an i.o. u-ring context and only the operation defined in that list can be executed. This also prevents that a new i.o. u-ring features accidentally become available for the guest. The allow list can be installed using the i.o. u-ring register system code but the rings must start disabled using the R-disabled flag during the setup. In this state no operation can be submitted. When the restriction are installed we can enable the ring processing using the enable rings opcode with i.o. u-ring register system code. This allows us to avoid critical races between the creation of the rings and the installation of the restrictions. With the allow list we can restrict the i.o. u-ring register opcodes for example disabling the possibility to register new buffers or file descriptors in this way the guest can use only the file descriptors that we already registered. We can also limit the SQE opcodes allowing only a subset of operation and we can specify which SQE flags are allowed or required for each operation. For example if we want that each SQE uses only the file descriptor registered we need to require that the fixed file flag must be set in each SQE. With this mechanism implemented in i.o. u-ring we can safely share submission and completion queues with the guest. We analyzed a proof of concept to analyze the performance and we compared it with bare metal and vertio block device emulation in QMU that we saw some slides ago. In our test we run f.i.o with i.o. u-ring and gyne and 4k block sites. We measured the number of i.o. operation per second that we put in the vertical axis. So the work. The unit is kilo i.o.s. so 1000 i.o. operation per second and the results are really encouraging since in the worst case where i.o. depth is one there is only one request and fly and in this case the gap between i.o. u. pass through and bare metal is less than 13%. So this gap is scattered by the fact that we have to cross twice the Linux block layer. One time in the guest and another one in the host. As we can see increasing the number of operation in flight the gap go to zero. Compared with vertio block device emulation in QMU the first bar in the graph we are always faster as we skip a big piece of software stack and we avoid the translation from vertq to i.o. u-ring qs. An alternative to i.o. u-ring pass through is to move the device emulation in the kernel using vhost. Also in this case we will have a single communication channel since the vertq is shared between guest and host kernels. In the last years some of the host block implementation was published upstream but never merged. The first version from Asus used the bi.io API the lowest API just up to the block drivers. The second version from Vitaly moved to VFS API. This allowed us to use also raw files stored in a file system. VFS adds some overhead of course but it's negligible and it's also the same interface used by i.o. u-ring. So I compared this last version with i.o. u-ring pass through improving a bit the implementation adding vertq and block device polling. I run the same FIO configuration that we saw before the first bar is maintained without polling so it's the original version. In the second bar I added the vertq polling in the vhost block emulation so we avoided the notification from the guest. In the yellow bar we enabled the block IO polling so we do busy wait in the host kernel avoiding interrupts from the device. And in the green bar we enabled sq-polling in FIO running in the guest. As you can see also with polling there is still a gap with i.o. u-ring pass through and it should be related to the rings allocation. With vhost block the vertq and the descriptors are allocating in the guest memory so the host kernel must call copy hin and copy to for each request. This is not needed with i.o. u-ring pass through where the submission and completion queues are allocated in the host kernel and then mapped in the guest memory. As I said, the host block was never merged upstream but recently a new framework has been developed especially to afloat vertq processing to the allowable. This framework is called VDPA vertio data path acceleration. Our idea is to implement a VDPA block software device very similar to vhost block but using this new framework. This allow us to unify the software stack and reuse the same code, for example in QMU when hardware implementation will be available. In addition, with VDPA we have more control than we host on device lifecycle. And the guest pages are pending memory so we don't need to memory map user buffers or do copy hin and copy to for each descriptors. This should allow us to fill the gap with i.o. u-ring pass through. On the other hand, the pen pages don't allow us to over commit the guest memory and we also need to implement the polling strategies and BFS integration that we are already available in i.o. u-ring. Concluding, in the next months we will implement a proof of concept of VDPA block software device starting from a VDPA block simulator in the kernel then we will add the support of VDPA block on QMU and we will develop the Linux VDPA driver with the device emulation and BFS integration. We will also work on the block IOPOL optimization for the birthday of block driver and we will try to have the missing features to i.o. u-ring to complete the i.o. u-ring pass through implementation. So thank you very much and now it's time for questions.