 So good morning, everyone. I'm Stefano Garzarella. And today with my colleague, Hermann Maglione, we will talk about UBlock, virtual block devices in the user space. So these are the main topics that we will cover today. We will start with understanding what are virtual block devices and what are our users for. And then we will do a brief introduction of UBlock and IEuring, which is the main feature that UBlock used. And then we will take a closer look at the UBlock driver, the user space libraries, and some use cases and examples. And speaking about virtual disks, we will cover also the QCov2 image format. And also we will see how to reuse the QEMU storage feature without virtual machines using the QEMU storage daemon. And finally, we will also talk a bit about the evolution of QSD, the Rust Storage daemon. So virtual block devices are usually those devices that we use to store permanent information. And they can be like NBME or magnetic disks. So we usually have fixed side blocks that we call Sector. But in this case, we are talking about virtual block devices. And with virtual, we mean devices simulated by software. So why we need this? The first use case, of course, is obvious. It's virtual machines. But we can also use these devices without VMs. Because the disk image format that we have for virtual machine like QCov2 offers really nice feature, like growing images, snapshots, backing files, that we can reuse also in some use cases where the isolation is needed, for example, with containers. Another reason to use virtual block devices is to mount, for example, into the host virtual disks in order to do some kind of manipulation. And finally, the virtual block device allow us to use natural distributed storage, like NBD, Iskasi, and SaffrBD. So Linux already provides several implementation of virtual block device in kernel. And for example, the loop device, NBD, RBD, and some others. But of course, for example, for SaffrBD, the entire protocol is implemented into the kernel. So an issue into the protocol implementation could be, I mean, a big issue and propagated to the entire kernel causing panic. So for this reason, it could be interesting to move at least the protocol implementation in user space. So of course, we still need a small piece of kernel code, the module, to do, to interface the Linux block layer with the demon is emulating the device. But, I mean, the entire implementation of the protocol could be moved in user space. So just to summarize why we need to do this in user, we can do this in user space. Yeah, the usual thing is safety, isolation, maintainability, portability. Of course, we need to pay a trade-off that is performance. So, and this is the talk, I mean, the main reason of your block. So the idea is to try to fill the gap with the internal virtual block device because your block is based on IOU ring and try to use, I mean, to take advantage of the high performance of IOU ring to do, to move the device emulation in user space. Your block was introduced by Mengele, pretty recently in Linux 6.0, and essentially is a kernel module, your block driver that expose a standard, I mean, a regular Linux block device to the user space and forward all the requests to the demon using IOU ring queues and shared memory. We will see later. So you block, the you block driver provides several interfaces. The first one is the you block control character device. So this device is used to do the setup, I mean, to allocate the device and set up it. And when we created the device, then we have two new interfaces. The first one is the you block B0 in this case, in the picture, which is a regular Linux block device. And then we have also the you block C0 in this case, that it is the character device exposed to the demon for them, I mean, to handle the request. So essentially what you block does, every request that an application or a file system does on this regular Linux block device will be forwarded to the demon through the you block C interface using IOU ring queues and shared memory. So the shared memory is used essentially to forward all the request information. And the IOU ring queue is used to send exchange notification between the kernel and the application. We will see more detail in a bit. We can do like now a brief introduction of IOU ring. Maybe most of you already know, but yeah, it will be useful to refresh. This set of slides, we already talked about that in a previous talk, I put the link. So today I will go really, really quickly. But if you want more information, there is a link here. So what is IOU ring? IOU ring is a Linux interface that is useful for doing asynchronous IOU. So it was introduced by Ansak's boy in Linux 5.1 and initially was focused on block requests, but then it evolved us to support more and more system calls. So now it becomes like a generic framework for doing a system call in a synchronous way. The interface is pretty, pretty simple. There are two rings queue that are shared in memory between the user space and the kernel and three system calls. So we have two queues essentially to avoid contagion. So there is one queue, the submission queue where the producer is the user space, the application and the consumer is the kernel. And then there is the completion queue that is the other way around. So the kernel is the producer and the application is the consumer. So talking about the system call, the first one to be called is the IOU ring setup. So almost allocate the memory and set up the context. Then we have the IOU ring register. It is also used only, I mean, partially in the configuration phase. It's allowed essentially to register resources that are often used during the data path. So using this way the kernel doesn't need, for example, to remap user buffer every time and can also use it to register files, the script or even FD and other things. And finally, the last one is IOU ring hunter. This is, I mean, the main system call used during the data path and it's used by the application to notify the kernel when there are new operation to do in the submission queue and also to rape the completion of that operation through the completion queue. All of these things. Anyway, we have a library, LibUring that is available and it provides a convenience API that hides all of these system calls. So now let's take a closer look of how the IOU ring queues works and how an application can submit things. So the first thing that application needs to do when it has a new operation to do, for example, a write operation, it needs to produce a new SQE, that means submission queue entry and need to fill all the fields like opcode. So that is the write opcode and then all the operation that we usually use, for example, in the write system call. So the file descriptor, the address of the buffer, the offset, and other things. Of course, the application can put multiple SQE before calling the system call. So when the application puts the SQE into the submission queue, update the tail of the submission queue and when it's ready, invoke IOU ring enter to notify the kernel. At this point, the kernel starts to consume the submission queue entry and update the head of the submission queue and essentially store internally the information of all of the operation that needs to do. Now the kernel starts to process the operation. It can be do in any order. So there is the only way to link the SQE with the SQE, that means completion queue entry, is the user data. So the user data, I mean, it's a 64-bit value. It's completely opaque for the kernel and the application can put anything. So when the kernel completed the operation, it produces a SQE and it will essentially copy the user data from here to here. In this way, the application can link a SQE with the SQE. And so when the operation is completed, the kernel produces a SQE, copy the user data and set the result and flags and updates the tail of the completion queue. Also in this case, the kernel can produce more than one completion in one call. So when it's finished, it returns to the user space and the application can consume the SQE, updating in this case the head of the completion queue. So this was essentially how IEuring works. Now, there are a lot of other features and a lot of other details that we can cover, but one of the features that I want to show you is this one, it's called IEuring pass-through command. This pretty recently was introduced in the Linux 5.19 and it consists of a new operation, a new code, the IEuring command and two new flags that essentially allow to double the size of both SQE and SQE. So why it's useful? So the pass-through command is essentially a way to do asynchronous to do asynchronous command for special file like TraactorDevice that can be exposed by driver, file system, or any kernel component. It mainly, it's an asynchronous alternative to IOC tools. Essentially both of them allow to define new special command from kernel code without implementing new system codes. And now it's used in the kernel and the first user was the NVME subsystem that essentially used it to replace IOC tools. So for this reason can be used as an alternative and another user is you block, as we will see. So in very short word, the user used the submission, the user space used the submission queue to send arbitrary command to the kernel and the new flag allowed to send up to 80 bytes as a command data. And then the kernel one used the completion queue element, queue entry to respond back the result in a completely asynchronous way. Now let's take a look on how you block use this IOuring pass-through command. So the first device that we saw was the UblockControl and it provides several commands through this interface through the IOuring pass-through command. The first one is up there and is the command that is used to set up the device to allocate the device and set up the most of the parameters with the Dmon point of view. So the number of queues, the queue size feature. When this command is successfully completed, the UblockC0, this one, is created and it will be used later for the data path. Another command used for the UblockControl ActiveDevice is the setParams. This command is essentially used to set up the parameters of the standard Linux block device, like number of sectors, the block size, and attributes, other attributes. So when the application configured all the things and it's ready to start to process the request called the start.comman. And after this command, the UblockB0, the standard, the regular Linux block device is created. Of course, we have other commands but we don't have time to cover. The last thing the application needs to do before starting to process the new request is to map the shared memory with the UblockC0, I mean on the UblockC0 ActiveDevice. This shared memory contains one structure for each queue element and that are identified by a queue index and a tag. This is the tag coming from the block layer, from the multi queue Linux block layer. And we use essentially the tag as an index into the queue. And this shared memory is essentially used by the kernel to write all the information for the request that then the user space demon needs to handle. So at this point, the demon is able to start to serve the request coming from the block layer, from the block device. So what usually the UblockDemon should do is to fetch and commit request coming from the device using the IOuring command with the UblockC, in this picture, the UblockC0 character device. So the SQE in this case are used by the, are submitted by the demon to notify the kernel. For example, to notify that a request is completed. And the SQE is used by the kernel to notify the user space demon that there is a new request to do. And as we saw, to link the submission queue with the completion queue entry, we have the user data. And this is opaque value for IOuring, but also for the UblockDriver. So it's only used by the application, I mean by the demon to link an SQE with a SQE. Now let's take a closer look of the IOuring command used on the UblockC0 interface. The first one that should be used is the fetch request command. This command is issued only once per HQ element at the beginning of the application. And essentially it is used by the demon to tell to the kernel, hey, I'm ready to process a new request for this slot. And a slot is identified by the QID and tag. And the application must also provide a pointer to the user space buffer that will be used to handle that request. In this way, the kernel can map it, for example, put the data there. At this point, the application filled all the, all the, I mean, tell to the kernel that is ready for all the slots, almost. And when the kernel receive a new request coming from the Linux block device, it generate a new SQE and put into the completion queue to notify the application, hey, we have a new request to complete, to do. At this point, we need a new command to the application need to use to tell to the kernel, hey, I completed the request. And it is this command, commit and fetch request. Essentially, it's kind of optimization. So it's a single command that allow us to, allow the application to tell to the kernel, we completed the request, and now we are able to handle a new request for that slot. So it's much similar to the previous one. The only difference is now the result field is valid, contains a valid value that is the result of the application, of the request completed. The rest is pretty much the same of the previous command. So the SQE are always used by the kernel to notify the application, we need to do another request. This was the overview of the kernel staff of IORN and new block, and now let Hermann talk about the user space. Okay, sorry. Well, how we can use that? Currently, this is very experimental. So not only the kernel API is changed a lot, but also the user space API is changed a lot. So one of the options that you have is to use the UBD server from Ming. It's also a very fast moving target. So there is a lot of development inside. And currently it's just developer to be used as a server directly. Providing some of the block device that includes that. For example, QQo2 that is one of the motivations to develop this new block. There is many previous attempt to put QQo2 inside of the Linux kernel. So, but it's a, let's say quite complicated format and like a big target. So that's the idea. Inside of the UBD server, there is the library. They leave UBLK server, but the still is also very experimental. So the API is changed a lot, but there is plan to in the near future to take that library to outside and be a proper library to be used directly instead. I was trying to use that. SBDK also supports UBLK, but they have their own internal library. Basically, currently it's one of the easiest option is to write your own library to interface directly with the kernel if you need to implement it. And they do that. Also there is a Rust version developed by this fine gentleman. Yeah, yeah, this gentleman. Yeah, I have a whole country for me. Sorry, I lost the thread, I get it. But currently since the kernel, especially the kernel is like a moving target, we are waiting until at least the kernel API stabilize a little bit. And also I think Ming is interested to continue the working in that. The idea of that great is also to provide, of course, a Rust API that is safe and difficult to misuse. And maybe small and very simple C bindings. So the idea is to be double use, basically. Another option is developed by Richard Jones, one of our colleagues, is NVIDU UBLK. He wrote, it take the client, the NVIDU client from the kernel to user space uses using UBLK. So it's quite nice also as an example to how to use this library, he is using the Ming's library currently. This is another very interesting project for Richard, that is UBLK, the other, that is part of NVIDI Kit. NVIDI Kit is an NVIDI server, client, I always mix server, with a plugin interface. So the idea is you have a plugin that is file, so you can say, okay, this file I spoke to this as a NVIDI device or just a memory, et cetera, or a file over SSH. So the idea is to reuse all that framework, that infrastructure of plugins using UBLK. So if someone want to experiment to UBLK, I think this project is the easiest one to start. Also because you can write your plugin in Python, for example. So I would sincerely recommend this project if someone want to do some experiment with UBLK, this is the easiest path currently. As I said before, KUCO 2 was one of the main motivations behind UBLK. Currently the UBLK KUCO 2 that is implemented by Ming is basic, it's not fully complete. As I say, it's not that easy format. They have a lot of features. So the idea is KUCO 2 has a really nice set of features and we want to provide those features outside of a virtual machine. For example, especially for containers users, or to simply to a better loopback device, for example, with all the features that we have for KUCO 2. And giving that, thinking, okay, we want KUCO 2 with all the features set, but all the external implementation of KUCO 2 are very obviously simple, like write and read and pretty much that's it. Why we don't use the QM blocklier, the QM storage that is already have a lot of features, the full feature of KUCO 2, and we just export that implementation using UBLK. So that is the project with Stefano and with the inter from Oricci, who was doing the last few months, to be able to export all the blocklayer, the QM blocklier using UBLK. The QSD currently supports NVID as an export fuse. Those two are not designed to be fast for performance and the VHOS user block that user is restricted to VMs and the VHOS that is a similar idea like UBLK. I mean, not the same implementation, but the same idea to have block device in user space. This is the project that I talked before, is to extend the QM storage demon, export all the blocklayer using UBLK. We have a proof of concept, is still the developments is we are working on it, so if someone find, want to find like a very interesting project to collaborate, this is one of those. And also the RASCRADE, the guy that makes the RASCRADE says good guy. And also one of the motivation of the RASCRADE, by the way, is to export the new RSD, also to have this capability to export the new blocklayer of QM using UBLK. The main motivation behind the RASCRADE as demon is a bit complicated has to be with an ongoing development to have multi-Q in the QSD, and recently was released version 0.1, so there is a couple of blog posts that are really interesting to read. So also recommend to take a look, and it's also very interesting project if someone wants to dive into RASCRADE, and especially async RASCRADE. So this is the next thing that we want to work. And basically it's all I got, so thank you. The question is if we compare the performance of the UBLK in user space with other currently options, for example, the NVD, not yet. We have some fake numbers, but not proper benchmark. We expect that, for example, the NVD of use will be faster because we don't need to use the whole network stack, for example, of course, NVD also can be used in a unique socket, but also those were not designed for performance. And also we still think we can tune IOuring more because the thing with IOuring, all the IOuring-based stuff is if you want performance, you cannot just throw IOuring into the program and try to make it faster. So you need to design it from scratch to be able to support that. So the idea is to complete the UBLK export and then the proper benchmark, especially with big use, that is probably the main competition. The question is if we can use this unprivileged without root, basically. Currently, yes, it's a recent development. We didn't talk about that because it's, well, unprivileged user support was a feature recently added. So yes, it's a student development. Yeah, I saw some patches in the mailing list, but not yet merged. But anyway, UBLK will gain the possibility to use unprivileged users. Yeah, in the user space it's already done, but the kernel party is missing, but of course you can compile those patches and have the, but yeah, that's in the near future. The original motivation also was to provide this for containers. Yes, excellent. The question is we submit request and complete, I mean, and rip completion with are you in common? So if we have requests in flight, we take new requests only when we issue a new are you in common? Is that the question? Yeah, I mean, you're, I mean, you issue a new common to fetch the other request to do, but yeah, in the meantime, you can still serve the other request and you don't need to complete all of them. Yeah, okay, but you can also, I mean, you can also, if you don't want to complete that one, you can also ask for a new request. Also if you don't, I mean, you didn't complete that one. No, I mean, the commit and fetch request is per single tag per queue. So if you have, for example, 100 element into the queue, you are used only the first 10%, you can still ask for the rest. So you don't need to complete the first 10. Right. So in this way you can do, I mean, you can fetch a new element also if you are not completed the previous one. So for each element that is in, I mean, for each tag you need to complete and of course because it's, I mean, the tag is the one generated by the Linux block layer multi queue. So if there is, the request in flight cannot be used for another request. The question is, when we use a new block, the device, I mean, you mean the Linux block device will be in right mode? I don't know this detail, I don't know if you... No, currently the device is not a name spacing. The security goes differently. Basically, especially in the unprivileged part, is only the family, the three family of the process that created the device can access that device, but it's not name spacing. Unless, that is my knowledge of today. I don't know if that changes in, because Ming is pretty fast to do stuff, so probably this morning is name spacing. Okay, well, but this is outside the name space but you can restrict for the process that is inside the name space. But yeah, I get it, the name space is preferable, yeah. So an idea that we can work on that. If you have any more questions, you can ask personally. Okay, thank you again.