 Okay, hi everyone, I'm Albert Feria. I'm here with Stefan Hey-Nazi, we're both from Red Hat, and this is a presentation about Libloc IO, a high-performance block IO API. So, to get started, let's do a brief recap of what block devices are. In this presentation, by block device, we really just mean the general storage abstraction, which is essentially an array of blocks, of bytes. So blocks are fixed size, and the array of blocks also has a fixed size, so the number of blocks is fixed. So most NVMe devices, and SCSI devices, and Vertioblock devices follow this block device model, and they're block devices. Now, the way a user would interact with block device is by submitting requests. The most simple kinds of requests will be written writes, right? Just to read one or more blocks, or write one or more blocks at once. Subblock accesses aren't possible with block devices. It's not possible to read or write just part of a block, only one or more blocks at once. Other kinds of requests are flush, which ensures that it is persistently stored, and there's also discard, write zeros, perhaps a couple more in some devices, et cetera. And block devices appear, or are used really everywhere pretty much, although directly, it's usually more infrastructure. So things like databases, and file systems, and hypervisors. Applications can also use block devices directly, and many do, but perhaps most commonly, they use a file system, or some other high-level storage abstraction, which in turn uses a block device. So, Liblock IOD, the library we're presenting here, actually came out of CUMU. So in CUMU, there's block drivers, and over time, CUMU really accumulated a lot of different block drivers, from simple IOEuring bindings to actual user space NVMe drivers. And still, more drivers were needed, in particular, recently, the VerteoBlockVhose VTPA, and VerteoBlockVhose user. But the code for these drivers, really, although it is CUMU-specific, it generally could be made more generic, and could be used by other applications, but the way it is written, it is specific and embedded in CUMU. So we decided to develop these new drivers in a separate library. And that's really where Liblock IO comes in. So today, there are quite a few different block IO interfaces, which are all interfaces to access block devices, but they all kind of differ from each other, even in certain respects, have some advantages, and different applicability. So just plain old POSIX read and write system calls, it's very simple and visagnostic, and access a lot of different block devices, as long as the kernel supports them. IO-euring, in Linux, is similar, but asynchronous, and can get this less system call overhead. More recently, even IO-euring also supports Euring-CMD, which allows us to submit NVMe commands more directly to an NVMe device, bypassing the VFS. And it's also possible to implement user-space NVMe drivers, and user-space VFIO drivers, for a virtual block driver, sorry. And there's also things like VHOS, VDPI to access a slice of physical virtual block devices. Sorry, sorry, Steph, I couldn't take the spot. No worries. Yeah, so I'll move on to the next interface that I'm about to want to mention. We also have Vhost user block, which is used in, for example, connecting SPDK processes to QMUVMs. So what we've got here is we've got a lot of different storage interfaces, and they seem to keep growing at the moment. There's more and more of them. But what's the same? What do they all have in common? Well, they all have the same basic types of IO requests, like reads, writes flush, discard, writes zeros. They have the concept of cues. But the problem is, if you're developing an application that uses block IO, then implementing one takes some amount of effort. But when you implement the next one, all the little details are gonna be different. If it's a descriptor ring, the layout's gonna be different. Does it support polling? How do you integrate it into your event loop? And are there any IO memory buffer constraints? That's something we'll get into later. All these little things are kind of different between these different interfaces. So that means that the overhead of adding support for another one is relatively high. It's gonna take you time to implement. All right, do you wanna? Okay. Okay, so that's where live block IO comes in. So that was the idea that we have all these things. We're seeing a lot of duplication of implementing, say IO U-ring support and so on. And we'd like to be able to take that and have a library that does that. So that applications, not just QEMU, can use it. Again, going back to kind of our initial reason for doing this, yes, we can do it in QEMU and we can invest the time and effort in implementing these block drivers. But then when some other program also wants to access the same kind of storage in order to get to the disk images that we wanna share from our VMs, we can't because the effort's gonna be significant in duplicating that again. So we've got live block IO. It can be used for all these use cases if you have a database or a file system. Obviously all the emulators and hypervisors that are doing block IO. And also IO frameworks can use it as well as the backup and forensics and disk imaging tools. So what is live block IO? It's a C API. The actual library itself is implemented in Rust but it's a C API because we want it to be easy to integrate into applications written in any language. And it provides these different drivers that we mentioned so that you have access to these different block IO interfaces all through a unified API. The drivers that we include at the moment, we've just made the live block IO 1.0 release. We have an IO U-ring driver. We have the IO U-ring NVMe U-ring command driver that Abato wrote. There's a Verdeo block VFAO PCI driver, Verdeo block V host VDPA and a Verdeo block V host user driver. So you already get quite a few drivers and we're hoping that in the future basically when we develop new drivers we can do it here in live block IO so that it's easy to share and reuse. To give you kind of an idea of how the API is used and how you would use live block IO in your own application, you start by creating a block IO instance and you can tell the block IO which driver you want. Like right here we want IO U-ring. At that point you've created the instance but you haven't set it up yet. You haven't configured it. So you can set, for example, with an IO U-ring file you can have a path that you want to set in order to tell it which file to open. So there's some API calls you can make to do that. Whatever type of driver, whatever type of storage is, here's where you set up your connection and then you call block IO connect to get a connected instance. At this point you can do the last minute set up things like how many queues would you like and so on and then you start the instance and at that point you can get the queues, you can submit IO. So that's the life cycle. And the queues themselves, typically the model for queues and storage and block devices is that they're all independent. Some applications might want to implement their own thread safety if they're multi-threaded but a typical thing is for each thread to have its own queue, especially if you do thread per core then it's very straightforward. You don't need to worry about locking. Every thread just has its own queue. Okay. So the typical queue semantics that block devices have are also what the library does. That means there's no request ordering either within or between queues. And that's normal because applications that do block IO they have to do this themselves. So if they want to make sure that one request completes before another, they would have to wait for that first request complete. They wouldn't submit them both simultaneously. One thing that block IO abstracts which can be a bit of a pain when working with low-level devices is that typically the device has some kind of limit in the number of requests you can put in the queue because the queue has resources like descriptors. And so if you're just writing a relatively high-level program that's not trying to, I don't know, do low-level stuff and optimize its performance too much, then it can be tedious to build your own kind of flow control mechanism like back pressure where you stop when the queue is full. And so the block IO does have built-in queuing so you can actually submit as many requests as you want which makes writing the applications convenient. So it adds a little bit on top because you wouldn't get that if you were just accessing a raw device. Okay. The API offers three different IO modes for integrating it into your application. The simple one, of course, is the blocking IO mode where you say do a read and you block until the read is done. But the limitation of that approach is that you can only do one request at a time per thread. And so that's why event-driven applications would not do this. They would take a more async approach where they can submit as many requests as they want. And then they use their event loop to wait for completions to come in and process them. So that's the event-driven IO approach. Their lib-block IO provides an event-fd kind of thing. They can easily integrate into your event loop. The third type of IO is for applications where you're really trying to minimize latency and that's polled IO. For polling, we actually ended up implementing two different approaches to polling because it depends on your needs. The first approach is what we call application-level polling. It's where the application itself makes repeated calls into lib-block IO to find out if the IO it's waiting for is already done. The problem with this, though, is that if you, for example, use IO U-ring, you're not truly getting polling all the way down the stack. So with the application-level polling approach, you're going to be making system calls to ask IO U-ring if there are new requests that have finished. And so you still have IRQs. That means that you still have the CPU overhead of the IRQs that are coming in. And so we also have what we call driver-level polling mode where the application gives up control. It calls into lib-block IO and lib-block IO does the polling. And what that means is we can go down all the way. We can use the Linux IO poll feature, set the high-pri flag on IO requests, which tells the Linux block layer to poll internally in the kernel. So then the NVMe driver, for example, can sit there and spin, and it can submit the requests on a queue that has no interrupts. So then you have no interrupt overhead. You don't have to waste CPU doing that. Okay. One thing I should mention, you might be thinking, why did you implement two modes if this one is obviously better? You should always do driver-level polling. And the answer is that with application-level polling, if the application actually keeps calling to IO U-ring, obviously that means that thread can also do other things. So if it needs to also wait for a socket or something else, then that's a more usable way because when you let the driver poll and you do IO poll, then you're kind of stuck to doing just that one thing, monitoring that one device. So if you're waiting for, you need another thread to do it. We saw before that in the device lifecycle, part of creating your block IO instance is setting the properties like the path. And the block IO driver is actually export a bunch of other properties that you can also read, some of them you can write. We have things like the maximum transfer size of an IO request, which basically tells you, okay, this hardware won't allow you to submit requests that are bigger than whatever the limit is. We have memory buffer alignment, so all this kind of information can be read out of the block IO. Yeah, the number of queues is one of the most obvious ones that you would want to set. And so the drivers have these properties that the application can use to configure them. Okay, so that's kind of the basic stuff. This is pretty standard for doing any block IO. One of the interesting things is that some of the drivers we support, like Vhost user, Vhost user block, and VFIO PCI-based devices like Verdeo block on VFIO PCI, they don't have the ability to directly access memory, not without some preparation. And so we need to register memory regions first with the block IO. You can't just take an arbitrary pointer in your virtual address space and do IO to and from it, and say a P-read system call. So the reason for that is that, for example, with VFIO, you need to do DMA mapping. You need to program the IOMMU so that it knows which pages the hardware can transfer. And so that's why we have an API for memory regions, and that abstracts this. So it doesn't matter whether you're using Vhost user or VFIO, this memory region API allows the application to say, this is where my IO buffers are going to be, I want to register it, and from then on you can use that memory and VFIO devices and so on will work. Some of the drivers don't need this. For example, just IO U-ring by itself. If you don't enable fixed buffers, which is one of the optimizations it has, then you don't even need to do this. But it's there and an application that wants to use, all drivers will have to do it. Okay, so the basic idea with the memory regions is simply that the application registers any IO buffers, any memory ahead of time before it does the requests, and you can unmap them later if it wants to. There can be a limit on the number. For example, Vhost user devices don't support an infinite number of memory regions, so that's also something to take care of. And the applications have to do that. Okay, so that's kind of the basic summary. That's the core of what can lib block IO do? What is it? It's a unified multi-Q block IO API, and it takes a lot of the functionality that we're developing now kind of in the QMU community and putting it somewhere where other programs can also reuse them. Okay, so next let's move on to a case study. I'll explain how we integrated lib block IO into QMU. As a QMU block driver. And also some performance evaluation. So QMU has a full block layer because QMU is an emulator, and it emulates all these storage controllers. It emulates IDE controllers and SCSI HPAs and so on. And so it's a little bit different from some of the other programs we showed before like databases, for example, that would probably have a more limited set of IO request types and API calls that they need to get their job done because they just want to store information. Whereas QMU, since it needs to emulate disks, it needs to represent everything that a disk can do. And so QMU is actually a pretty good proof of concept of integrating lib block IO into an application. If QMU can emulate disks using lib block IO, basically making all the API calls that lib block IO has to offer, then we can be pretty confident that more limited use cases are going to work as well. The drivers that we have for QMU using lib block IO are IO U-ring, Verdeo, vhost user block, and vhost VTPA block. And this takes around 700 lines of code, but like I mentioned, if you want to implement lib block IO support in something else, say you have a database, you probably need less code because you're not going to be calling every single API call and enumerating everything a disk can do. We expect the driver to land in QMU 7.2 so in the next release. The patches are on the mailing list. So just a recap to see where this fits in in QMU's architecture. With QMU, when you're running a virtual machine, a guest, the guest submits IO requests and QMU's device emulation code picks them up and it passes them to the QMU block layer. The QMU block layer then has these different block drivers. So it has what we've added to QMU is the block IO block driver and that is the bridge that connects to the block IO. If you want to know more about block drivers and developing them, there's actually a presentation that kind of goes into the details, and I'm just going to focus on the most interesting part of doing this integration. Most of it was just calling the lib block IO API calls in a straightforward way, but the one area where we kind of have a challenge is how do we deal with the memory where the IO buffers reside in QMU? In QMU we have kind of three types of memory that are used for IO requests. We have guest RAM, which is pretty long-lived. It's more or less static. Every once in a while you might hot plug some RAM and so that could change, but in general, it's pretty straightforward. But then we have QMU block drivers like the QCa2 image file format or crypto, and what they need to do is they have internal buffers that they need, either for metadata or the crypto driver. It first needs to read the encrypted data and then it decrypts it into guest RAM. It then wants buffer in between basically. So we have them, and then we also have some places that do small amounts of IO and they can just use an IO buffer, like a variable somewhere on a stack or heap allocated, so really that can be kind of anywhere and it's not set up in a very organized way. So how do we deal with memory regions? This is a problem that you won't have if you're developing a new application from scratch for lib block IO because then you'll come up with a simple strategy for okay, let's allocate IO buffers here, but if you're integrating it into an existing application like QMU here, you're going to have to figure this out. Where's my memory located and how can I register it? So registering the guest RAM in QMU is easy because I said it's pretty static so we can just enumerate it and register it and if it changes, we just have a callback and we go and we register that new part. The harder part is these intermediate buffers that I mentioned that come from within QMU itself and are not in guest RAM, so there's a few strategies that people take to solve this kind of problem. One approach might be to just allocate a buffer temporarily and then map it, copy, do the mem copy between this new bounce buffer and then unmap the buffer and free it after the request is done, but that's going to be the most expensive approach. So that's the temporary mappings approach. Another approach is to take a memory pool and map that whole thing and whenever you need to get a bounce buffer, just use it from the pool, at least that way you avoid doing the mappings and unmapping all the time. And the final approach would be to go in security source code and find all these places where these IO buffers come from and change all of them and update them. Maybe to use a new API that fetches memory from a place that's already mapped, from a pool that's already mapped, but that's very invasive and it's very hard to do for the stack and heap buffers, right? Because if it's a stack buffer, then you're actually going to need to change code significantly to manage this memory if you want to take it from a pool and release it. So we didn't do that. For the block IO driver in QMU, we decided to have a pool and that way we can avoid the cost of mapping. But this is definitely something that you would consider when integrating into other existing applications. In case that you're kind of wondering, okay, what's the big deal with the mapping? Why is it expensive and what is it? I did want to show this a little bit. The V host user block devices might be some of the most expensive ones because if you want to register a memory, first of all, because the device is running in a separate process from your process, so QMU is running one process, you have the V host user block device running in another process, how does that other process even access the memory, right? So they have a mechanism for sharing memory and they send over, over a Unix domain socket, they send over an add region request and pass a file descriptor to that shared memory and then the V host user device can M map it and if you want to unmap, you do the reverse of that and so you can see that several system calls are involved, there's IPC involved and so on, so this is going to be expensive and slow. And to take a look at that, so what's the cost of mapping and unmapping and why is it a good idea to have a good strategy? I just ran a small benchmark with the QMU storage daemon acting as the V host user block device and QMU talking to it with the lip block IO V host user block driver and so the difference between mapping for every single request and unmapping versus having a permanent buffer that's mapped is pretty significant, it's like a three times difference between the two, yeah. Okay, so the mapping strategy is very important to optimize in order to get good performance. But what about the performance of the lip block IO itself, I want to show that it has low overhead because if it has high overhead then you probably say nah, I'd rather just integrate, say IO U ring directly and not use lip block IO. So we wanted to take a look at how expensive it is. That's pretty easy to do because QMU itself has IO U ring support already. So we can compare QMU's native IO U ring against lip block IO and see what the overhead is there. And so here's a benchmark results for random reads and you can see that the two bars are pretty close. There are differences and we'll definitely take a look at those differences now that the block IO 1.0 is released maybe focus some more on the performance but this is in microsecond, so these differences are relatively small. So you can see native QMU versus lip block IO doesn't add too much overhead. But that's just virtualization. What about just lip block IO and let's get rid of QMU and let's really look at just the bare metal performance. Alberto implemented a IO engine for FIO that adds all the lip block IO. So that way what you can do is you can compare FIOs built in IO U ring and compare it to lip block IO directly without running a VM just bare metal application. And he also implemented the polling support so we can have a look at what that looks like. And on here you can see first of all that if you care about latency you're going to use polling because it's a lot lower latency. But again you get kind of similar results even without the VMs you still see that the bars are pretty close and we'll still take a look and see how we can optimize this. But it's already looking like it's reasonable to say that the block IO has low overhead. Okay, so next I want to move on to some of the future work what's the direction for lip block IO and what do we still have that we want to do? So in the beginning I mentioned we have this C API and that's primarily what the library is but the library is written in Rust new things are being written in Rust and we want to have a native API in Rust and the reason for that is because if we have a native API we can actually design the API to be safe. Whereas the C API obviously involves passing a bunch of pointers assuming that the caller is going to honor like the lifetime of these buffers and IOVex and so on that responsibility is on the caller that's not checked by anything whereas if we can design a native API for it in Rust then we can actually make sure that the lifecycle is correct and the compiler can help us check it. But that turns out we're not there yet this is something that is work in progress at the moment the Rust side of the API which is experimental and already exists it's already on crates.io if you want to grab the block IO crate you can check it out and get access to this stuff but you'll find that the methods are still just passing raw IOVex pointers and they assume that you're going to hold on to those IOVex until their request is done we haven't found a nice mechanism to do it yet and I think we've seen that in Rust async and threading libraries have the same kind of problem because you can hand them stuff it goes off and does some stuff and they also have kind of thought about how do we introduce the concepts of scopes so that when we have say an IO buffer that we want to read into we can scope it and make sure that that IO buffer stays alive as long as that request and so we haven't gotten there yet and if there are Rust people here who are interested in this kind of thing it would be great to have more discussion and also contributions or collaboration on that so Hannah Reitz has been experimenting with using the block IO just from Rust and also integrating it into async Rust code alright another big one is that primarily the block IO is for local storage because that's where the kind of really low latency like sub 10 microsecond kind of things that we've been optimizing are but of course there's always the possibility of extending this to remote block devices block devices that are over a network or some kind of fabric so if you have NVMe over TCP why not you could add a driver to do that to the block IO as well but we are aware that the block IO API the control path is actually blocking so when I showed you the life cycle and how you create a device and you call a connect function and you call a start function then the servers are blocking they're not async and so if you imagine connecting to an NVMe TCP target and the network is down or the server just doesn't respond for a long time then it will hang your thread because the block IO is going to be waiting so at some point if we want to go there and do network storage drivers then we probably will need an async API as well okay and another thing that we have and that Stefano Gattarello is working on is queue pass through this is not like a general thing that you would probably use for most applications but for emulators and virtualization it's a very interesting optimization because if your virtual machine is doing NVMe and the underlying device is NVMe or if your virtual machine is doing virtual block and the underlying device is also virtual block why do we have all these layers in between why do we have the emulator in between why do we have the block IO in between what if we could take the queue from the actual device and map it into the guest and then the VMs drivers or application would be accessing that directly and we would be bypassing those queues so Stefano is currently working on the queue pass through APIs we're going to start with Virdeo block and so the idea there is let's take the V-ring that the guest has configured and let's pass it down to typically would either be a V-host user block device or a V-host VDPA block device so that's also something we're working on and that's it are there any questions should I repeat it yeah the last slide have you thought about something like crypto if you want to use that dmcrypt or some of those blocks in the guest block IO block IO and not in the host IO how would that map because this is definitely interesting would definitely reduce the layers yeah so I think that it definitely depends on your use case like what configuration you're going to have you can pass through a device like say you have a V-host VDPA block device basically a hardware Virdeo block device into a guest then of course the guest would be the one to do the encryption because now you're bypassing all the layers so that's like the straightforward way of doing it but maybe were you thinking of a case where you can't do the encryption in the guest I haven't really thought about pass through at all I was thinking more in the direction of block IO being emulated by dmM it means that you lose if you are bypassing all these layers you will lose their functionality so either you have to do it in the guest or in the case of what Stefan has been designing I think that his goal is to actually have a dynamic mechanism so when you're not using any features like IO throttling that's done by the hypervisor and stuff like that that needs to intercept IO or maybe encryption then you take the fast path but when you enable those features because some of them are temporary like maybe during live migration you need to do storage migration then let's not do pass through temporarily so the Q pass through api is not a static thing that can only be set once you can flip back and forth at runtime so that kind of answers some of what you say because you don't actually lose all the functionality if you can only do the fast path when you need it do you have scripting bindings for this like python bash nope we don't have anything in that moment just the CAPI and the Rust API which is a work in progress so you mentioned the problem with the async bindings in Rust and how the other libraries had similar issues with that so how do you deal with things like lifetimes in Rust in this context do you expose something that is strictly C-like and everything is considered static or is it possible to also take advantage of the Rust lifetime management yeah I mean did you have something on it? well at the moment we the Rust API on the part of data buffers is pretty similar to the CAPI which has these pointers this is a difficult issue to solve in the general case in special cases where the user can ensure that the buffer is static or that the user can ensure that the request will be completed before the end of the lifetime sure but in general there might not be a full solution full general solution to avoiding unsafe with these kind of things especially when we then bring into the scenario the memory regions so you looked at Tokio and things like that yeah so we started looking at that it seems like one of the not 100% general solutions but accepted solutions that are being used is that you have some kind of scoping API where you can say I'm going to declare a scope and within the scope I can do IO and I can use lip-lock IO but outside the scope I won't be able to and that way all my IO buffers and so on are going to be I have a lifetime within the scope and that way we can guarantee that stuff won't go out of scope yeah async in particular will help a bit the Rust async because then we can ensure that after the call actually is completed the buffer doesn't need to stay alive anymore so that part solves it a bit but if the user doesn't want to use Rust async or if the user uses Rust async and memory regions we probably will still need to have some insight in there if there are no more questions have a nice day