 Hi everyone, my name is Ben Walker and I'll be presenting along with Chung Hong Liu on high performance NVMe virtualization with SBDK and the new VFIO user protocol. So we'll be covering today four sections. The first is standardization of the protocol itself, which is new. This is being standardized through QEMU and in a library called live VFIO user. Then we'll talk about how SBDK is using this new standard protocol to emulate NVMe devices into guests. And then we'll close with my colleague Chung Hong talking about how we have implemented an NVMe client library and then some benchmarks. So I'll start a little bit with standardization, but first I want to talk about some upcoming talks that will occur later at this conference. Importantly, there is a talk on the live VFIO user library that is open source and available that implements a client and a server. Using the new VFIO user protocol, this one will be at the very end of the conference. I know this talk that I'm giving now is at the very beginning, so you'll have to wait. But this will cover all of the details about how the protocol works, the current status of the enabling library, which is called live VFIO user and a lot of the lower level details that we'll have to skip over here. There's also a talk on live migration support, which we'll touch on VFIO user also tomorrow on Thursday. This will cover a lot of the details about how to migrate VMs when you're using VFIO user to present disks. So we are going to talk about using VFIO user here to present NVMe devices into the guest. So with that, since I'm going before the other talks, some brief background on what VFIO user is, there is a need to emulate storage devices or many kinds of devices, but storage in our case, outside of the VMM. And the reason we want to do this is for performance, for security, but also for stability and resilience. If the target process crashes, we need it to be able to fast restart and not take down the VM. We may want to run the emulation or the target even itself in a separate VM. So this concept of VFIO user was initially conceived as a way to emulate devices in a separate process or outside of the guest for disks. This was originally, the goal was we want to present virtualized NVMe devices into guests. But of course, VFIO user has gotten much broader than that use case. Now it can emulate any device type. So the actual VFIO user protocol, and again, there's much more detail coming up on this in later talks, is modeled after the VFIO I octals, because basically we want to do the same thing, except we're not sending I octals, we want to send commands over a Unix domain socket. And so the other way to think about this is it's like V host user, but it's not tied to verb IO. So the protocol looks more like VFIO and you can do arbitrary operations to emulate a broader range of devices instead of just verb IO devices. And so the protocol itself is agnostic to the VMM. This is not necessarily tied to QEMU. Anything can implement this. And we'll show a demonstration of full NN implementation that doesn't use QEMU at all in this talk. OK, so emulating NVMe devices. The initial problem statement was we want to present emulated NVMe devices into guests, because the guests all have NVMe drivers built into them, you know, Windows does, the BIOS has an NVMe driver, you know, any, of course, Linux does and FreeBSD and things like that. Any operating system you would probably be running and a guest has an NVMe driver in it, whereas they don't all have VirdIO drivers, VirdIO block in particular. So the original idea was we want to present NVMe devices fully virtualized, but NVMe devices into the guest and we need a way to do that. And so the guys at Nutanix and other places went off and found a way to emulate any PCIe device. They came up with several strategies and it's it's sort of morphed over time to become VFIO user and then they said, OK, now we've built a little prototype, you know, we're maybe emulating a GPIO device or something simple. Now we need code that knows how to act like NVMe. And so they they came to me in the broader SBDK team and said, hey, you guys know a lot about NVMe stuff. What can we use in terms of code to pretend it's an NVMe device? And my first thought was, well, NVMe over fabrics, which is remote connected NVMe over transfers like TCP or RDMA already requires the software to essentially emulate an NVMe device. There's a couple of important differences, which is what we'll cover mostly in this talk. And it already at least the target in SBDK and also in the Linux kernel, they're already designed to have this concept of a pluggable transport layer. So NVMe over fabrics allows you to export discs over the network using either TCP or RDMA or Fiber channel. And so the software is already designed to have a common layer and a plug-in system for different fabrics transports. So the idea is maybe we can reuse all of this. So that's exactly what we did. We set out to make a new NVMe over fabrics transport where the transport is shared memory with VFIO user. So the transport here is really shared memory between the guest in a backing process and a Unix domain socket for control messages. And that's what we'll cover the rest of the time in this talk. Is the problems with that and also the success we've had with that. So NVMe over fabrics, unfortunately, is slightly different than local NVMe in particular, some of the initialization flow is reversed and I'm going to get into that. So the first question we had to answer is can we generalize the NVMe over transport plug-in system that was already in SBDK to handle these differences. And fortunately, the answer is yes. So here's what it looks like today. This is all checked into SBDK and is available for anybody to test out. You have a QEMU with a guest running and it is running its regular stock NVMe driver, you know, whatever that guest operating system is, whether it's Windows or Linux or whatever thinks it has a local NVMe device and it loads its NVMe driver. The device itself is emulated by the SBDK NVMe over fabrics target running on the same host system just in a separate process. And it is communicating with QEMU to do initial setup, discovery of the device and all that using libvfiouser, which again will be covered in that talk later in this conference, as both the client and the server side. So SBDK is using it on the server side here. And the actual data transfers between the guest and the target and the descriptors that describe, you know, I want to do a read or a write to the disk are also placed in shared memory rings. And so once the setup is done, it's all just a shared memory system. There's no system calls or anything like that. It's very fast. So this is all checked in. It's available for people to go check out in SBDK today and it works. Okay, so let's go into a little bit of the details about what we struggled with to get this working because that's the interesting part. In NVMe over fabrics, traditionally, there is a concept of a listener because it's a fabric space thing. So something listens on an endpoint, you know, an IP address for TCP. And that IP address periodically attempts to accept new connections after it's listened on that IP address. And those connections become like the NVMe over fabrics Q-Pairs. For a local NVMe device, none of this exists. And so this was all baked into like the generic layer in SBDK NVMe over fabrics. Originally, you know, it assumed there was some listener socket that, you know, occasionally had to be polled with an accept to find connections. It didn't necessarily assume it was a socket. But this model was baked into the generic layer. And of course that doesn't happen with a VFI user device. There's only one Unix domain socket and it's not a listener, right? You just connect to it and you assume the other side is there and it's going to send you messages. And you don't have to pull for accept. There's only one, it's point to point. There's only one guest connected to your Unix domain socket per device. And so you're not doing multiplexing. You're not doing multipath. Any of these things, these are all disabled. So local NVMe device effectively. So the first thing I do is sit down and generalize the concept of a listener to become an endpoint. And really we push down the idea of having a listener and having to pull it to accept new connections down into the individual transports because now they behave differently. And so each transport is doing whatever it thinks is right. TCP and RUM are fairly similar here. They're both IP based. Fiber channel does its thing and VFI user is very simple. It just opens the domain socket and assumes somebody's on the other side. Okay. And so some of the other challenges are with a local NVMe device. There are register reads and writes that the driver is going to make. NVMe by spec defines a set of registers in a PCIe bar. And these are read and written with MMIO during initialization time. And so the main differences are with a fabrics device. There's obviously no MMIO. There's no PCIe bar in NVMe over fabrics. These are replaced by something called fabrics property get or set command. And so the NVMF target NVMe over fabrics target in SBDK understands how to process property get and set commands, but it doesn't know what an MMIO is. So fortunately, Livvia FIO user, the way it does MMIO is we can create a thread blocked on the file descriptor on the target side. And that file descriptor will become ready and wake our thread up anytime the client does an MMIO to the memory mapped region. And so it'll wake us up and say, hey, they did a right, or they're trying to read from this memory range. And we then take that and we generate a fake fabrics property get and set command as needed based on if it's an MMIO read or write. And then we send it to the SBDK NVMe over fabrics target. So we basically translate it from an MMIO style operation into a command. So these commands are inherently asynchronous, whereas MMIO, at least the read side is synchronous. So we have to for MMIO read have this thread block. So it sends the command to SBDK, which processes it asynchronously and completes it sometime later. But the thread that does MMIO has to sit blocked just that thread, not the whole target has to sit blocked until it gets a completion back. For the read path. And then it can return control back to live via our user, which completes the read request. So this is not particularly fast, but fortunately NVMe only does MMIO reads in the configuration path, like the initial setup. For writes, it's a little bit simpler because MMIO writes are posted. And so you don't have to wait for the response. We basically say, oh, you wrote four bytes to this offset in the bar. Okay, we'll generate a property set command and we'll send it and forget it. The other challenge we had is that the set of registers that you're allowed to do an MMIO read or write on is a little different than the set of things that you're allowed to do a property get and set on, just because local devices are different than remote devices. And so we had to add additional emulation. I think it was just two or three things to get this fully supported. All right. And then another major difference with actual fabric connected devices versus a local NVMe devices, the order in which they create what's called the admin Q pair. And so for NVMe devices, every controller has a single admin Q pair where you do administrative stuff. And then it has a set of NIO Q pairs, usually one per cores, how people allocate those. And those do reads and writes from the disk. And for a fabrics device, you have to create the admin Q pair first because it's a TCP connection and you can't talk to the remote side until you have at least one connection. So you create that first and then you read or write the properties, quote unquote, which are like registers using that connection. But for a local NVMe device, it's the other way around. You actually read or write registers in the bar first in order to create the admin Q pair. And then you can do other admin stuff with commands after that. So this presented a little bit of a challenge where of course handling MMIO by sending commands on the admin Q pair, these property get and set commands. And they're doing MMIO before they've created the admin Q. So we have to, what we do in SBDK on the backend is as soon as the endpoint is created and we say we're going to create a UNIX domain socket and you should open it up, we just create the admin Q pair internally in SBDK. And it's not mapped to any shared memory in the guest yet. So you can't send admin commands on it from the guest right away. But we can still send our internally generated property get and set commands at the beginning. And so we just create one of those right away. We know we're going to get a bunch of MMIO to do configuration and we can send those on that admin Q internally. And then later on in the initialization process when the guest tries to create an actual admin Q, it will say, here's my memory in the guest. I've got my descriptor ring of shared memory that I'd like you to look at when I try to send you admin commands. Then we take our internal admin Q and we point it at this newly mapped memory and say, here's really where the commands are instead of this temporary internal thing. And so this sort of works out the, this order of operations difference between fabrics, NVMe over fabrics devices and local NVMe devices, so that we can still use the NVMe over fabrics target in SBDK to emulate these local devices. So that was probably the trickiest part of this whole thing, but it actually worked out in a relatively small amount of code. So with those things, it just started working. We got everything together and we pushed the patches to SBDK to generalize the transport interfaces ahead of time before the FIO user was actually ready. And when it came down to it, the final patch that went into SBDK to support VFIO user only added a new plugin. It didn't change anything else in SBDK. It was just add a file and touch some of the build stuff so it gets compiled. So this was great. This was exactly what we wanted is, we just wrote a new transport plugin and nothing else had to change. And so this generalization also we think will be useful to us in the future as SBDK begins to be used in more places, in particular as firmware on various accelerator devices. They're going to have other pieces of hardware that will behave in various ways that are NVMe related. We may do a quick transport at some point, which will behave more like TCP and RDMA, but make it easier for us to do that. But may have aspects needed from what we've done here. And ultimately, SBDK becomes a great NVMe emulator. And we have since used this to quickly prototype possible NVMe features that we're thinking about adding to real SSD products. Just hack it up in SBDK real quick and pretend you have the feature and see what it does, see how it behaves, see if we can use it. So it's pretty neat. Okay, from here, I'm going to hand it off to my colleague, Chanpeng, who's going to actually talk about how we used VFIO user on the client side so that we have an alternate client to QEMU that can connect to our target, primarily for our testing purposes. And then he'll close with some talks about performance benchmarks, which is the exciting part. Thanks, Ben. So let's continue the rest of this presentation. The first is the NVMe client library. As you already know that QEMU client is a user usage scenario with VFIO user NVMe target. But we can also support SBDK NVMe client with VFIO user NVMe target. We have a client, VFIO user, PCI library to provide abstract API to access the emulated PCI devices. And on top of this library, we added VFIO user transport in NVMe client library so that users can use the existing NVMe APIs to connect with the remote VFIO user NVMe target. Okay, let's see the performance numbers of VFIO user NVMe target and VFIO user block target. Actually, VFIO user and VFIO user target have different thread modeling. For VHOS target, the IO is processed in unit of VHOS controller. That means for multiple IO queues in each controller, all the IO queues are processed in the same core context. While for the VFIO user NVMe target, each submission queue can be processed in different calls. Here, we have an example to emulated the efficiency of VFIO user. We use a knob, BDAV, and 16 VCPUs and 16 IO queues. The performance number of VFIO user NVMe can increase based on the number of calls used in host. While VHOS user block is almost the same even you use four calls in the host. Here is another two test cases. The chart one tested four VMs using one, two, four calls respectively. You can see the performance number increase linearly when used in one, two, four calls in the whole side. Another test case showed in chart two. We would like to compare the performance with VHOS user block. For this comparison, we don't expect VFIO user NVMe will improve the performance a lot due to the actually similar mechanism. The result shows that VFIO user NVMe is a little still a little slower than VHOS user block. So that means for the outpass, we still have some work to do to optimize the outprocessing for VFIO user NVMe. And here is another several test cases using the physical NVMe SSD as a backend. First of all, we tested interoperative SSD using SPDG NVMe Proof 2. The performance number hit the hardware's limitation. We will use this performance number as the base number to compare with other test cases. Then we run SPDG NVMe Proof 2 inside the VM to test the emulated NVMe SSD inside the VM. And we can see that due to the pulling in mechanism in both the VM and the whole side, we can see that the performance number in the VM is almost the same with physical NVMe SSD. And finally, we run a file test cases inside the VM using the native NVMe driver and the VTHO block driver, respectively. We can see that for VFIO user NVMe, we can get 725K OPS while the VHOS user block can get 786K OPS. VFIO user NVMe can reach about 84% of the physical NVMe base performance number while VHOS user block can reach about 91% of the base performance number. So it's really very hard and efficient for the true virtualization solution. That's all for today's presentation. For any questions and issues, feel free to contact with us in the SPDG Select channel. You may find the information in SPDG.io website. Thanks for watching.