 Okay, so welcome back everyone. Let's talk a bit about QMU Storage daemon and the block IO and the cool things that you can do with them. My name is Kevin Wolf. And presenting with me is Stefano Garzella. We're both working in the virtualization team of Red Hat. And I'll start off by talking about what we have today, like the traditional setup that we have with QMU, that probably everybody here knows. So the typical thing that you had is just you have one QMU process that emulates one guest, and some of these devices that the guest uses are this, and the QMU process just accesses an image file to expose it to the guest. You could have multiple VMs, of course, on the same host, which has to be a second QMU process that opens its own file. Over the time, we got a slightly less traditional use case where we had not only devices accessing the image file, but we also did some more advanced things inside of QMU with the images that we have, such as the background block jobs that we introduced. For example, used in the case of live storage migration, or backups, stuff like this, for managing snapshots. MBD client is used also in the context of storage migration. But what's still the same thing here is you have one QMU process, and that QMU process has the image file that it accesses. Yeah, everything is in one process. It serves a single VM. Obviously, it's only available while the VM is running, so you can't run a block job while you don't have a VM running because it's tied to the same QMU process. When you want to share images between two VMs, for example, you have a template that both VMs are based on, you can do that, but only as long as both QMU processes actually access it really only, because otherwise if you modify a shared file, the other process wouldn't know that it actually has changed, so it would cause corruption. So that's what we used to have. The new thing that we got, it's actually not that new anymore per se. QMU storage demon has been in QMU for a while, but it's only now like approaching the stage where you can do the really cool things with it. So what is QMU storage team? It's basically you take the block layer from QMU and move it out into a separate process. That's the basic idea behind QMU storage team. So if you want to describe what QMU storage demon is, you could approach it in two ways. You could either say it's QMU without the VM, you have just the storage things in it, and everything that is related to storage that can be used in conjunction with the storage functionality, such as the QMP monitor, it's also present in the storage demon. Or you could approach it from the other side and say it's like an advanced QMU MBD with the added monitor with added functionality like QMU MBD for example, you can share only a single image. Whereas the QMU storage demon can have as many images as you want, just like the real QMU. As I said, it supports the relevant subset of QMP, so it's everything that is related to VMs, obviously doesn't work in QMU storage demon, but it has all of the QMP commands to manage block devices, to start block jobs, all of these things. You can see on the slide an example invocation of the QMU storage demon. It has the dash block dev option that you might know from QMU, it works exactly the same way. And the dash export thing is something that we don't have in QMU because it's not the primary use case of QMU, it's running VMs, but one of the primary use cases of the QMU storage demon is actually to export the block back and to be used by some other process, for example, by a QMU process. So it's actually a separate command line option. So we can now change the traditional use case and do something like this instead. It's basically the same as before, but the whole block layer has moved into a separate process. Now, why should we do that? And I have several reasons why you might want to do this. The first one is quite simply isolation. Like for security reasons, it's something you might want to just have more separation between all of the code accessing images and the code running VMs. Both sites need less privileges now. Like the storage site doesn't need all of the the functionality that is needed to actually run guests. It doesn't need access to the KVM, for example. And the QMU site doesn't need to access images anymore. So if you ever have any security issue there, like will be much more contained. So isolation is an interesting use case. Somewhat similar, but different is like separation of concerns. You could say that managing storage is one thing. Running VMs is a different thing. You might not want to have the same component to both. And that is exactly the stance that Hubert takes. They want to manage VMs, but storage is basically someone else's problem. And that's someone else that could be QMU storage demon. It just does all of that and serves the storage backend with all of the functionality that QMU could provide to something that cares only about the VM part of everything. This is another use case that we've had before. Imagine you start a block job, maybe to do a backup while the VM is running. And now you're shutting down the VM. Traditionally this means that you have to cancel the block job and your backup is like it just isn't completed. So you have to restart the next time. That's not actually what you want. But if you have the storage demon running in a separate process, you can just shut off the VM, but the backup job keeps running. In the same way if you start to backup while the VM is down, you could start the VM and attach it to the storage demon that is already running for the backup job. So we can have some interesting features there. Okay, after the use case that didn't use any QMU at all, or sometimes any QMU at all, what we can also do is we have a single process, single QMU storage demon that serves multiple VMs. So where before we only had one QMU process, accesses one file, we can now have multiple VMs that access the same storage demon and that. Brings also a few interesting advantages. In the beginning I mentioned the case where you share a backing chain, like you have a single template that multiple VMs are based on, and where the backing chain becomes essentially your only because otherwise you get corruption. If you have a QMU storage demon in between that opens it, then you can modify the backing chain as you want. For example, delete a snapshot from it, and it just works. You can share a CPU for polling, which is something you might want to do in a low latency use case. So you want to get high performance. So essentially you set a CPU aside per IO thread to poll for event completion. And now you can have a single IO thread that serves multiple VMs actually. So you don't have to set aside a whole CPU for multiple VMs F, a whole CPU for each VM, but you can use one for multiple of them. You could share a single hardware device among several VMs. For example, it could be you have a single NVME disk. You want to use the user space driver in QMU, but you can only like open that device once. If you had to attach it to the QMU process directly, only one of them could have it. But if you open it in a QMU storage demon and then let all of the QMU process access the QMU storage demon, now this becomes possible. And finally use case, we could also attach things that are not virtual machines. Like we have different export types and I'll come to that in a second. But for example, we have a fuse export type which allows you just mounting it to the host and using it from the host. Or you could attach directly to application that natively speaks the Vhost user protocol, things like that. Obviously that also directly applies to containers because if you can have them on the host, you can have them in containers too. So, let me give you a short introduction of the different blocks of export types that we support in QMU storage demon. So the oldest one that probably most people know is NBD, the network block device. It exports the storage over network, over TCP connection or over UNIX domain socket, that's possible too. It has existed for many years in QMU already and it's actually the usual thing that you use when you do live storage migration without shared storage. It's not really a high performance protocol because it involves like copying the data around over network even sometimes. So it's definitely not zero copy so it comes with some performance cost but yeah, it has been a wide use for, especially in the context of storage migration. Then we recently, I think means QMU 6.0, recently got the fuse export type which takes an image and mounts it on the host as a single image file. So for example, you can open QCOW 2 file in the storage demon and export that as a fuse file and that file essentially looks like a raw image. That has the content of the QCOW 2 image that you open in the storage demon. So this is the export type that you might want to use with anything that doesn't know anything about images, disk images, about VMs. Like we have not really optimized this yet. It's working but it's still fully synchronous so still some work to be done there. Then we have Vhost Usure block and that's basically the high performance thing that you can use to connect a guest to QMU Storage Demon because the guest device basically talks directly to the QMU Storage Demon using the shared memory of the virtual queue. The QMU process is only involved for setting everything up and then it's just a shared memory between the storage demon and the guest with QMU not involved at all. So that's the best case that you can have for performance. When you use this and you have further block in the guest, essentially you don't lose any performance by moving things out into a separate process because whether it's handled by the QMU process or the QMU Storage Demon doesn't really matter as long as you don't add another step in the path. And finally we have a fourth export type which is VDUs and that exports QMU block back end as a VDPA device. Stefano will talk a bit more about VDPA and general what it is and how it works. Right and I think with that I'll hand over to you. Okay. Thanks. As Kevin just presented, the QMU Storage Demon has several exports and focusing now on the host user block and VDUs. In both cases, the QMU Storage Demon emulates a virtual block device. So the guest must support virtual block driver. So in this scenario, the QMU block layer is completely bypassed since the guest is talking directly with the device. And but if the guest does not support a virtual block, QMU could emulate an IDE disk for example, but to do that need to access the device. So the QMU block layer needs a virtual block driver to access the virtual block device emulated by the QMU Storage Demon. And this was one of the use case that motivated us to develop a new library called LibLock.io. I will not go into details because next Wednesday, Stefan and Alberto will talk more about the LibLock.io API, its use cases, and the supported drivers. I really suggest to follow it to understand better LibLock.io. As I had level overview, LibLock.io is a new library that provides a single API for efficiently accessing block devices. It's written in Rust, but it expose also a C API for C application like QMU. And among the supported drivers, we just have a virtual block, which is what we need to get QMU block layer to talk to QSD when it emulates and a virtual block device. So LibLock.io, I mean the application can use LibLock.io API to access virtual block devices. The configuration and data part are completely abstracted by the library. And the library allocates the word queues. So every request queued through the LibLock.io API will be directly queued into the virtual block device. So there is, I mean, application do not need to implement the virtual block driver to implement it. And, but it can use the library to do that. So LibLock.io supports several transport for the virtual block driver. We have, for example, the virtual block V host user driver that implements V host user front end to communicate with the V host user back end like the QMU storage demon. And it also provide a virtual block V host VDPA driver to access VDPA devices using the V host VDPA interface which we will see later. So going back to the example we saw before, now QMU can use LibLock.io to talk with the QMU storage demon. And then we can use the QMU block layer and, for example, emulate an IDE disk to guess that does not support VDPA block. We already mentioned VDPA when we talk at videos. Now let's take a closer look to understand how LibLock.io helped us to use the VDPA devices with the QMU block layer. So VDPA means VDPA data path acceleration. It was originally designed to accelerate VDPA devices in hardware where the data path must be fully compliant with the VDPA specification and the control path could be vendor specific. On top of the VDPA framework in the host kernel we have two ways to access the VDPA devices. We call them VDPA bus drivers. The first one is V host VDPA which is, I mean, it is based on the V host kernel interface providing additional high-octls to allow full control of the device. So this way the whole device is under control the user space application and it's the best interface for VM workloads. The second bus driver is called Vortio VDPA and it allows to attach a VDPA device directly into the host kernel. So, and for example in the case of VDPA block device a Vortio block driver is loaded inside the host kernel to handle the data path and connect the VDPA device with the Linux block layer. This way application running in the host or inside the container can use the VDPA, can access the VDPA device through the block device exposed by the Linux kernel for example in slash dash slash the VDA. You can find more information in at vdpa-dat.gitlub.au. So as we already saw, VDPA was designed for hardware accelerators. In this case the VRTQs are processed directly by hardware providing the best possible performance. Example are the smartNICs where in addition to providing network offloads for TCP-IP stack, now they are also able to emulate a Vortio block device to accelerate network block protocols such as Safari, BD, Iskasi or others depending on the vendors. The only required driver in this case is a small driver in the host kernel for the control path. For the data path everything should be there since the data path must be fully compliant with the Vortio specification. VDPA allow us to develop also software device in user space as Kevin mentioned using VDU's. VDU's is an additional kernel model that provide an API to emulate devices in user space and then the device is attached to the VDPA framework and exposed as a regular VDPA device. So it's almost very similar to V host user. The main difference is that thanks to the VDPA bus drivers we can use the VDPA devices with both VMs and container workloads. The last type of devices that we can develop with VDPA is software device, I mean software accelerators in kernel which is very similar to V host but the advantage here is that we can reuse all the software stack we are developing for VDPA for all the kind of accelerators we are seeing. So hardware, software in user space and in kernel. All the software stack starting from the kernel to LibLock.io, QEMO and LibVirt will be the same. We can use the in kernel software accelerator when we want high performance but the hardware does not support VDPA. So going back to LibLock.io it allows us to use the VDPA devices in the same way we saw for V host user device. So the QEMO storage features will be available for all the VDPA devices and QEMO for example can emulate any block device but when QEMO emulate a VIRT.io block device we have two VIRT.Qs involved but one between the guest and the guest driver and the QEMO device and one between LibLock.io and the VDPA device which is why we call this scenario slow path as opposed to the fast path where we can have a single VIRT.Q directly by the guest and the driver. Sorry the guest and the guest driver and the VDPA device. So the slow path is almost really thanks to LibLock.io while the fast path is among our future plans. So our idea is to extend the LibLock.io API to provide, I mean to enable the VIRT.Q pass through. So if the QEMO block layer is not needed for example because the VM is not using any of its features the guest VIRT.Qs can be exposed directly to the device and this will work for both VDPA and V host user devices. We would also like to provide an automatic mechanism in QEMO to do the switch between fast and slow path at runtime. This is because several features like IO throttling, live migration may be required while the VM is running. So summarizing our future plans we will shortly try to implement the VIRT.io pass through in LibLock.io and QEMO. About the storage demon we would like to support it in LibVirt and we will also explore potentially alternative implementation maybe based on Rust, we're using the native Rust API provided by LibLock.io and about the VDPA in kernel software device we realized the proof of concept that showed very good result. It was based on the VDPA simulator so still require a bit of work to be completed. So that's all for us and should we have some time for questions? Yes, I mean obviously there are security implications like they share the same state so if something goes wrong basically the user of one VM can access the storage of the other one like if you have a security problem so that's something you need to take into account. I think in many cases it actually doesn't really matter that much. I mean if the VMs belong basically logically to the same user it's maybe not that bad. Also these use cases are often related to like performance use cases like scenarios where you need high performance where maybe the isolation aspect is not that important. Yeah, but definitely something to take into account when you're setting up your configuration. One question there. No, we haven't actually done anything with SecComp yet. I mean usually restricting QMU is something that Lippert is the level that Lippert is doing. This is something that enables Lippert to do this like more specific in more specific ways than we currently are doing but we're not actually exploiting the opportunity yet. So this question was about file systems how we map users and things. Like we don't because this is just the block layer. So we don't have actually files, we have block devices. Right, but I think you're probably thinking about sharing the same file system between multiple VMs like that's basically a setup where you would use something like cluster file system and obviously they would require some way to sync this. Yeah, but you could in theory run a cluster file system on this. You actually could do that before because the cluster file system does all of the work of synchronizing things. So you could have a single image before and let just different KMU processes access it right. So that was possible before. Okay, so the question was how does this compare to VFIO user and moving devices out that way? They're different in a way that like VFIO moves the whole PCI device out, right? And we're moving the backend out. So I wouldn't say that their alternative approach is to the same problem. Maybe you could even combine them like have the separate PCI device still access the storage demon that could maybe serve multiple devices and in this case, instead of multiple VMs. So I think it just combines well together as well. The question was how to compare the host device with VDPA in kernel devices, right? Yeah, we expect very similar performance. So for example, for block, we never had the host block merged. So now it makes sense to use VDPA because we can reuse all the stack. So in theory, it should be very similar as performance. It's totally advantage to reuse the softwares that we are developing for both hardware accelerators and software accelerators. Yeah, maybe cool. But I mean, it was like a device for the perspective of the user. So we would like to be sure that everything would be right. So yeah, time is pretty much up. So if you have any more questions, just catch us outside in the hallway. Thank you. Thank you.