 Hi, my name is Stefan Heunerty and I'm going to talk about optimizing KVM for NVMe drives. So what are NVMe drives? NVMe is a standard interface for solid state disks. So it's a PCI storage controller interface and there is an open specification and a standard Linux driver that can talk to any NVMe compliant device and devices are made by multiple vendors. What's interesting about NVMe drives is that some of them have extremely good latency, much much lower latencies than we've had in the past. So I quoted two figures here. One is for an enterprise SSD so that would be something for data centers and servers and so on. This is an Intel Optane drive and it's quoted at 10 microseconds for reads and writes. But even the consumer SSDs can be very low latency so the Samsung 970 Evil Plus is quoted at around 17 microseconds write latency. Now just because a drive says it's NVMe, just because it uses that standard doesn't necessarily mean it's one of these low latency drives, there are significant differences so you should check the datasheet before purchasing drives. Now in 2012, Jeff Dean published a slide that became very popular called latency numbers every programmer should know and it's probably became popular because it's really interesting to see how long different operations in computer systems take. Now the reason I'm showing this table and the reason it's interesting is because if you look at other storage, if you look at spinning disks, hard disks that have platters that store the data magnetically and they have drive heads that need to move to the correct location in order to be able to read those blocks of data from the platter. The spinning disks have seek times in order to perform random accesses if we do the same random read, there's movement involved and that takes time. So the time for that is quoted as two milliseconds. So comparing these two things, hard disk versus an NVMe solid state disk, we have 2,000 microseconds for accessing a random read and 10 microseconds so there's a huge difference here and you can see why optimizing for NVMe drives is very different than from traditional disks. So let's have a look at a relationship between IOPS and latency. This is an important one, in fact this is the critical thing that drives the rest of this presentation and the optimization work that we're going to look into. It's the relationship between IOPS and latency. IOPS is the number of operations per second and here I'm showing a simple model of doing one operation at a time and the relationship is just the total runtime divided by the latency. So this is the number of operations we can complete in that runtime. So the first thing you notice is that this relationship is non-linear, right? It's not just a straight line, it is a curve and to start investigating it, imagine that we are at the right hand side of this curve, we're at 20 microseconds latency, that means it takes us 20 microseconds to complete our operation. If we identify an optimization we could make to make this system faster, say it shaves off two microseconds and we can bring it down to 18 microseconds latency and the graph shows us how many IOPS we are going to get if we move it from 20 to 18 microseconds latency. Well that slope is pretty flat over there at 20 microseconds so actually the number of IOPS doesn't increase that much. But if you imagine for a second that we were at four microseconds of latency on the left hand side of that graph and then we shave off two microseconds, so again here we're just shaving off two microseconds, the same adjustment in both cases. But now it brings us down to two microseconds and we go from 250K IOPS to 500K IOPS. And this might be an obvious relationship because when we were at 20 microseconds and we removed two, we only optimized away 10% of the total latency, but when we were at four and we went down to two, we optimized away 50% of latency. So it does make sense that the jump is going to be bigger here because it is a bigger proportional improvement. But still it leads to interesting things. First of all, think about having two independent optimizations A and B that you found and that you want to make. Well, one of the misleading things here is that if you apply optimization A and then you apply optimization B, say both of them reduce the total latency by two microseconds, then if optimization A gains you 10K IOPS, optimization B will gain you more than 10K IOPS because it was applied afterwards after we had already reduced the total latency. And because they're independent, if we do it the other order, the other way around, if we do optimization B first and then optimization A, then we'll have the reverse. Optimization B will give us 10K IOPS improvement and optimization B will give us more than that. So this can be misleading, right? It can be tricky in the order in which we apply optimization, we can kind of fool ourselves thinking that one boosts IOPS by a lot and the other one doesn't. But in terms of total latency, it doesn't necessarily tell us how much time we've reduced. And that's why saying something like IOPS increased by 10K doesn't really convey enough information to know what is going on. You need to know at least the basic IOPS level you're at before in order to understand how the latency came into play. The other thing that's important about this graph is that NVMe drives being at 10 microseconds or less latency if we focus on that. That is the left-hand side of this graph. That's where the graph becomes very nonlinear. That's where the slope gets very steep. So we need to keep that in mind that any small change to latency, increasing it or reducing it on the left-hand side of the graph can have a big effect on IOPS. You need to re-examine the guest and the host software stack because the hardware is faster. So now we need to re-examine the guest and host software stack. And we need to rethink the architecture of QEMU of KVM and of the Linux drivers and so on because the hardware has gotten so much faster. And when I say rethink the architecture, that's not a buzzword. What I mean by that is think back to that curve that we saw, the IOPS versus latency. The optimizations that we considered in the past and that had no measurable effect, they were not significant in the past. We decided they were too complex or for whatever reason we didn't implement them in the past. Those optimizations might now be relevant again because even shaving off a little bit of time on the left-hand side of that curve can boost IOPS a lot and can allow the application to get better performance. So that's why I mean rethinking the architecture because we really can reconsider things that didn't make sense in the past. And so that's the 10-microsecond challenge that we're going to talk about. Okay. So let's begin by looking at the IO-request life cycle. The IO-request life cycle is the core of what we need to understand in order to optimize this. Here is a high-level model. It doesn't show the specifics of virtualization. You don't see QMU, you don't see KVM in here or any emulated devices or the operating system or anything. This is just a high-level model. What it shows you is that an application that is running on the VCPU and decides to submit an IO-request, in this case it's a read request that is being submitted, that request needs to be prepared and then a message will be sent to the device notifying it that there is a request available and then the device can process that request. Now when that message has been sent, when we've submitted our IO, the VCPU might have no more work to do or it might have some other tasks that it can run in the meantime while we're waiting for the request to finish. But either way at this point the critical path is in the hardware because now the device with its latency of say 10 microseconds is going to be processing that request and when it finishes it sends back a completion notification to the VCPU. The VCPU is going to then process that completion and resume the application because the IO has finished. And so that's the life cycle of a single request, of just one request in isolation. And the important thing to understand here is that if we're trying to optimize the software layers we need to study the mechanism through which requests are submitted, the mechanism through which requests are completed, and of course the code path and the layers of code that are parsing requests or creating them in memory and so on because that's what we can optimize. That's under our control. So you might have noticed that I've been mentioning latency all the time. And in this presentation we are going to focus on latency but latency is just one performance factor. There are others. So request parallelism is another thing that really boosts performance and batching is one of the techniques to also benefit from that. What they can do is they can hide poor latency because you're able to do a lot of requests at once and that way you're able to get a lot of work done even if the latency for one request is relatively long. But that's not what we're looking at here. The reason why we're looking purely at latency is because it's a fundamental thing and if we optimize latency first then we can consider those factors later. Now there are latency sensitive applications and you can't hide poor latency from them. And the reason why is because they need a specific request to complete before they can continue. Even if there's a lot of parallelism available to them because they need to wait for one specific request they are bounded by the latency of that one request. And that's what we're trying to optimize here and that's what's most obvious. Those types of applications suffer the most when they're run on virtualization on extremely fast hardware because that's where the overheads become apparent. So what we're going to do is we're going to look at QDepth 1 benchmarks and that means only submitting one request at a time. We're also going to focus primarily on small block sizes. I will be showing graphs that have the larger block sizes just as a reference but what we'll see is that other factors come into play there that maybe the data transfer time and so on become more relevant and they dominate and we're seeing less of the completion and submission latency which affects applications with small block sizes. And if you think about networking performance the kind of analogy or similar thing there is benchmarking small packet sizes that tends to show the cost the per packet cost and it's the same thing here that's what we're focusing on. But you can get other perspectives if latency isn't the only thing you're interested in then take a look at these talks. There is another talk this year at KVM Forum which investigates NVMe performance and last year there was a great talk at KVM Forum that compared storage performance between hypervisors NVMe and various other things. Okay now I mentioned that the mechanisms through which we complete and submit requests are important because they can determine the performance or they can determine the latency. So let's have a look at the mechanisms that are used in Linux and in KVM today. These aren't the only mechanisms but they're the main ones. First there is eventfd which is a counter and it's a file descriptor. So when you read from this file descriptor the counter is reset to zero. If it's already zero then the read would block. Now if the counter is incremented multiple times you don't need to read it multiple times because a single read already resets it to zero. So what this means is that multiple notifications will be coalesced into a single read which can be nice for performance that helps. Now because it's an event because it's a file descriptor it relies on the kernel scheduler to wake threads because if you're trying to read from it or you're in a select style system call waiting for that file descriptor to become ready then your thread may be descheduled. Maybe the physical CPU will even be halted and put into a low power state and waking up again from a halted CPU and resuming the application thread that may have been descheduled has some latency. And so this is not necessarily the lowest latency approach the eventfd but it is widely used. It's used by VFIO interrupts by KVM's IO eventfd and IRQ mechanism and by Linux AIO and IOU rings completion mechanisms. So the other approach, polling or busy waiting is also popular. So this simply means looping and continuously checking for the event that you're waiting for. Of course if you do something from a tight loop that consumes CPU cycles and no other task can run while you're running. So it's not very power efficient, it does hog the CPU but the advantage of this is when I mentioned how eventfd relies on the scheduler and the CPU could be halted and there's a latency associated with that, well polling does not have that problem because when you're polling you are running on the CPU, you have control, there's no latency, you're just going to read the completion value and see that the request is now ready. And so it has great latency and that's why it's used. It's used by QMU's AIO context, by KVM's Hophole and S, by CPU idle Hophole and Linux IOPole and DPDK and SBDK, all these things. They use it in different ways in various flavors but busy waiting is used in order to reduce latency. The reason I covered these mechanisms is because later on when we look at the different optimizations available and the ones that I've tried out and that I'm presenting here, they are related to these and how to use them effectively. The starting point for NVMe performance, for achieving good performance in VMs is PCI device assignment. PCI device assignment is the way that achieves the highest performance. You can get bare metal performance by doing this. The reason why this is so efficient is because when you take a PCI device and you pass it through to the guest, then the host is actually not involved in the critical path of processing IO requests. This works because the PCI device, the hardware registers of the NVMe drive will be memory mapped into the guest so that accessing them doesn't require the hypervisor to be involved. And the IRQs, the interrupts that are raised by the NVMe drive, they can be directly injected into running guests if they are currently scheduled on a CPU, on a physical CPU. And that's thanks to posted interrupts, a CPU feature that's available on some CPUs. So again there, the hypervisor doesn't need to have an interrupt handler, doesn't need to forward that interrupt in the VM, instead the hardware is able to do that directly, which is good for latency. And then finally, for guest RAM access, the IOMMU, the IO memory management unit allows the physical PCI device, the NVMe drive, to directly access guest memory. So again there, you're also not involving a software layer in the hypervisor in order to perform any IO. So that's why it's fast and it's a great approach for high performance. It does have some limitations which cause it to not be as widely deployed as it would be great for performance, but it does not support live migration in most cases because that device is actually a black box to the hypervisor and QMU doesn't know about what's going on, it's unable to live migrate it because only the guest is using that device and has the driver and the state associated with it. Software features like backup and snapshots and so that QMU can offer when you're using disk images are also not available again because the hypervisor is bypassed. Another thing to consider is that exposing PCI devices to your guests may be inconvenient or may have security implications and this is also why sometimes it cannot be used. The issue here is that of course if you have varying hardware and you want to live migrate your guests around or if you just want to upgrade some of the hardware in your infrastructure then the guests will actually see those changes, they will see the new hardware so they need driver support and they might need reconfiguration if it's a different device that requires different setup and so that can be prohibitive because in some environments you don't control the guests, they may be very old and so on and so then you don't have the freedom to change the hardware. You might also be concerned about the guest being able to say do a firmware update on the device and so on so there are some issues there with exposing those devices so that's another thing to watch out for and the final thing is simply the cost because you do need to dedicate one PCI device to a particular guest so other guests won't be able to use that same device, they cannot share it because only one guest can be the one that's running the driver at a time. Now you can use SRIOV, some devices allow you to virtualize them and split them up into virtual PCI devices but that also has its limits. So here's the configuration, I'm not going to go into Libvert domain XML details in this presentation, I just want to show you the slides so that if you are watching this or reading this later on you could have links to the documentation and find the keywords and things to look up in order to apply this configuration. Okay, so we've mentioned that for starters PCI device assignment is the way to get good performance. You don't necessarily have the best performance right away with PCI device assignment unless you also consider the NUMA topology of your host and what this means is that on modern systems typically the system is divided into multiple domains that are called NUMA nodes and a processor and memory and PCI devices are associated with a particular NUMA node. Accesses to the resources within that node are local and they are cheaper than accesses to resources in other nodes and so for performance it is important that we keep be aware of locality and we make sure that our operations are within a NUMA node where possible. So the tools that you can use to investigate this are NUMA CTL and LSTAPO and then you can get an overview of the topology of your machine. There are also performance counters that you can use if you suspect your application might be making cross NUMA node memory accesses and so on. For more information check out this talk that Dario is giving at KVM Forum this year about topology and NUMA. So in terms of how you set things up, Libvert gives you the ability to pin VCPU threads to physical CPUs. You can also control QMU's own threads, the IO threads and the emulator thread. They can be pinned to physical CPUs. In addition you can control which host NUMA node to allocate portions of guest RAM for and you can also expose a virtual NUMA topology. You can describe a NUMA topology that the guest will see and the goal there is we want to align the guest's virtual NUMA topology with the host's NUMA topology. It should reflect what those resources on the host look like and by doing that the guest kernel as well as NUMA aware applications running inside the guest will be able to make good scheduling and allocation decisions because they'll have that locality information. They'll know what is cheap and what is expensive so they can choose the best configuration. Now here's a little example. This is trivial but what's interesting is that it can also demonstrate how quickly we hit limitations and trade-offs here when we do NUMA tuning. So let's say we have a one VCPU guest and what we want to do is we want this VM to do IO so we have an NVMe PCI adapter and you can see the topology here in the diagram on the left side of the slide. Now where should we put the VCPU on which node should it run? Since it's going to be doing IO and using the NVMe drive let's place it on node zero because that's where the NVMe drive is local. So instead of placing it on node one where it would have to cross the node we put it on node zero. So that's the starting point so let's pin the VCPU thread onto processor zero. Now if we're using an IO thread in QMU which we will also get into later on why using IO thread can be advantageous that is going to be doing IO on behalf of the guest and so that also needs to be where the NVMe PCI adapter is. And so we will pin it to processor one and of course we're on node zero so we want to be using RAM zero so hopefully guest RAM fits into RAM zero's range so there's enough memory there for our entire guest that would be great and that's the setup but you can already start to see some of the challenges. What if we wanted a guest that had more RAM than was available in RAM zero then maybe we would have to define a virtual NUMA node and use some of the memory from RAM one as well and hopefully the guest will then be able to make smart decisions about what to place into which of these two virtual NUMA nodes. Now if we add more guests to this picture it becomes even harder because at that point we need to maybe make sacrifices decide whether to share resources like processors across guests or whether to assign vCPUs to processors that are on what is going to turn out to be the wrong node. Now today NUMA tuning is something that pays off for performance critical VMs doing this manually pays off and hopefully in the future we'll see more automatic NUMA tuning support in the management tools that use KVM so that they can automatically set the things up and we don't need to manually tune it. It gets especially hard when we have a lot of VMs or when we do live migration so the situation is dynamic and it's no longer so easy to come up with a static pinning that makes sense. Okay so we covered the importance of NUMA and a bit about how to tune it and where to look so next up is CPU idle halt poll. This is going back to the IO request lifecycle that we looked at so what we saw was that the two important mechanisms that we have are submitting requests and completing them. Well if you have passed through an NVMe PCI device there are interrupts for the completion when request is complete there is an interrupt. Now halting a vCPU involves a VM exit and if there's no further work to do even on the host then maybe the physical CPU will halt too and it will go into low power state and then when the NVMe drive completes the request it will fire the interrupt the CPU will come out of that low power state and there's a latency cost associated with that and then we can re-enter that vCPU can VM enter and you can see that this becomes a chain of several steps and it has a latency cost so we want to avoid that. What the CPU idle halt poll driver does is it runs a busy wait loop inside the guest so on that vCPU at the point where it decides oh I have no more work to do so it has a timeout and it says well I'm going to try running a busy wait loop for a little while at least to see if something still becomes schedulable and becomes ready to run and so that way the vCPU is actually running is actually active when that interrupt comes in and so when that interrupt is delivered we can quickly schedule the application again after completing the request and we don't need to go through this long path of going down all the way to a halted physical CPU and coming back out that decreases latency. Now this mechanism makes sense when you are pinning vCPUs because when you're pulling you're wasting those CPU cycles until there's some work to do and of course if you had lots of VMs and if they were sharing CPUs you wouldn't want them all to pull so this is something to do when you have a high performance or performance critical VM and you have assigned it a dedicated CPU. Okay now here's the tuning this is just for the syntax I won't go into detail. Alright so now here is our first graph so these are the results for doing random reads and random writes Q depth 1 on an NVMe drive and what we see here is the blue bar the leftmost one is the bare metal result with the NVMe drive the red bar the one in the middle is the VFIO result so that is a virtual machine with a PCI device assigned to it and we can see that that's typically lower not in all cases but but it's it's often lower and even by a significant amount and then the final result is the VFIO with that CPU idle halt pole driver enabled and you can see that that one performs very well. Why is it performing better than bare metal right? How is that possible? Well it's because bare metal isn't doing polling and so bare metal might halt the CPU it will save more power but then it will also have higher latency and the VM is staying active the CPU is still running and so it's able to take those completion interrupts with lower latency so actually in this benchmark it achieves higher performance so that's what we see here and we're going to take this a step further there's another polling approach we can use and that will actually make the bare metal versus virtual machines comparison fairer and it will show us the final picture so here we go. The Linux NVMe driver allows you to allocate cues for specific usage so it's possible to actually reserve polling cues and what those pole mode cues do is when an application sets the high priority request flag then the kernel will busy wait for those requests to finish and it just calls the pole function in that driver in the NVMe driver and that function will just check it will just check memory and have a look to see if the request has completed yet so the kernel can do polling for us and this is called IOPOL and so this improves completion latency and actually more than CPU idle halt pole because here we're guaranteed to be spinning and it doesn't give up it keeps running in its default mode at least and this allows us to make a fair comparison because this is done by the driver in both bare metal and in the virtual machine case so at the bottom of the slide here's the syntax in case you want to see how to enable it and let's look at the performance numbers so as you can see on this graph at this point when we enable that feature first of all the absolute number of IOPS the number of IO operations per second that we can achieve have jumped the the max we were at was around 80k before we used IOPOL now that we're polling all the time we get up to over 120k IOPS for q-depth one four kilobyte requests so you can see that that's a significant performance improvement enabling IOPOL the other thing we see here is that the gap between bare metal and VMs has closed so at this point they're very similar and you can say that using PCI device pass through you can get bare metal performance so that's great but what about situations where you cannot use PCI device assignment right I had this big disclaimer I covered all the cons and why why you sometimes cannot use it well in that case you can use vertio block and vertio block is an emulated storage controller it's a para virtualized device that was designed specifically for virtualization so it's been optimized over the years and has evolved and so it's a good storage controller to choose if you want to get good performance with KVM there are two settings though that I want to discuss here because they're not enabled by default yet and they do boost performance so they're worth considering the first one is multi-q although the feature has been there for years it hasn't really been used and in q-mu 5.2 that's going to change in q-mu 5.2 the number of queues is going to default to the number of vc-pues in other words multi-q will be enabled by default on vertio block and also on vertio scuzzy devices the reason why this is a win the reason why this improves performance is because giving every cpu every vc-pue a dedicated queue means that now the completion interrupts they can be directed at the cpu that submitted the io and that's the one where the task is scheduled and where we want to do our completion we don't want that interrupt to go to some other vc-pue that then says oh okay I'd better wake up that task that's ready to run on on another cpu and I'll send a message we don't want inter processor interrupts so by giving every vc-pue its own queue we can eliminate that and we can improve interrupt completion latency so that's why multi-q helps in addition to this the linux block layer also has multi-q support and there are some code paths in the linux block layer that take advantage of it when the driver only allocates one queue we don't take those code paths and the most the most obvious user visible effect of this for example is that the io scheduler is different for devices that have more than one queue and that are multi-q block drivers so that affects latency too and so it's it's best to enable multi-q next there's the vertio 1.1 packed vert q layout so this is a new memory layout for the queues that vertio devices use and this layout is more efficient in the benchmarks that I've run I've seen that it improves vertio block performance so this is also worth taking into account it's not a huge win but it is a small win and and and when devices are so low latency then every small win is worth taking okay now here's the syntax we'll move on and what we're going to do is I'll walk I'll walk through several more configuration topics and then I'll walk through some optimizations that I've implemented some prototypes some new stuff and finally at the end we'll look at a graph that stacks them all up and we can see how incrementally by combining them we can increase IOPS significantly so here we are io threads is the next feature if you're using vertio block that is critical io threads are a way for defining threads and assigning devices to them so it gives users control over which physical CPU device emulation and IO will run on so there's an end-to-one mapping there you can assign multiple devices to a single IO thread and you can define multiple IO threads as well so it gives you a lot of flexibility and this is great because it allows us to reflect the NUMA topology in our system it is also good for scalability because when you have VMs with many devices that are doing heavy IO you may want to put them into separate IO threads and run them on separate CPU so that there's enough resources for both of them and no interference between them the final thing I want to mention about IO threads is that the IO threads feature when it's enabled that device will be able to take advantage of an adaptive polling event loop in QEMU it's a different code path from QEMU's main loop and it has lower latency because it's able to do polling instead of always yielding when it's waits for file descriptors to become ready so that's another reason why it's faster and we'll see the numbers later on when I show the graphs so here's the configuration the things you can do for defining IO threads pinning them on the host and then assigning devices to IO threads in Libvert XML syntax okay so now we're going to move on to things that aren't as standard yet things that aren't as widely known and as widely used there has been a user space NVMe driver in QEMU for some time now it's been there for a long time but it hasn't been used very widely yet what it is is it's somewhat similar to PCI device assignment in that that PCI device that NVMe drive can be assigned to a particular VM however instead of passing the device through into the guest and exposing that physical device to the guest we still have an emulated virtual block device that the guest uses and sees and in QEMU we have the driver so it's in QEMU user space in the host and it's an NVMe driver and so what this means is that we're able to get some performance benefits from having a user space driver that bypasses the kernel that means no system calls are necessary and there is a shorter code path that's completely under the control of QEMU while still offering QEMU's block layer features things like live migration or snapshots and so on even image formats can work on top of the user space driver so this solves some of the limitations of PCI device pass through and right now some improvements are being made upstream and activity has started up again around around this driver non x86 architecture support is being added multicues being added and more so this is the syntax for configuring it one thing that is missing from the NVMe user space driver in QEMU that I wanted to add an optimization I wanted to try out is polled queues because in NVMe when you create a queue you assign to it an interrupt the completion queue has an interrupt and you can turn that off completely while creating the queue and when you have a queue that doesn't have an interrupt you can just poll for completions you can just look at the memory and see when those requests become ready and so doing so is an alternative to interrupts effectively this is kind of like switching from the event of the style mechanism to a polling mechanism and so we hope that it will reduce latency one interesting thing about doing this though was that it requires changes to QEMU's event loop itself because QEMU's event loop is really fundamentally designed for file descriptor monitoring the adaptive polling has been added to it but the whole premise is that we only poll for short amounts of time now when we have a poll mode queue in the NVMe driver we need to poll all the time but if we pull all the time we will be starving the file descriptors right because we're just spinning in our busy loop but we're not looking at the file descriptors so I have some patches that I'm going to send upstream that extend the event loop and in fact some of the IOU ring work has already gone upstream and I found that using IOU ring we're able to do this efficiently and integrate it into the busy wait loop without using syscause okay so we'll look at the numbers for that in the end the next thing I want to share is an idea that in one form or another has already been around for a long time in 2014 when we introduced coroutines into the core block layer and started using them for request processing that was very useful because we needed them for for things like IO throttling and and some of the the the operations that were just getting really really complex and difficult to write in an asynchronous style so now there's request queuing and so on in in the core block layer in QEMU but there was concerns even back then that maybe this overhead will will become a problem and so there have been discussions in the past about can we optimize it away and really the thing is when you are not using certain QEMU features like disk image formats or IO throttling or storage migration while those things are inactive you don't really need to do the full request processing all that machinery that infrastructure is only needed to support those features so wouldn't it be great if there was a way to bypass it when it's not needed so as a prototype I've tried implementing this I've tried implementing an aio fast path what it does is it introduces an aio interface to the block drivers in QEMU because currently they have a coroutine interface which kind of assumes that you're in this full request processing mode and that allows the vertio block emulation to call the NVMe user space driver with with relatively little overhead and we can skip the full request processing step so we'll see those numbers the next thing I want to mention is that when we looked at PCI device assignment we saw how beneficial Linux IO poll is we saw that polling for the requests from the NVMe driver is it reduces latency and it got us the highest IOPS that we've achieved so far we had this 120k IOPS bare metal so vertio block today the guest driver for vertio block does not implement this interface yet but it's a driver interface in fact it's it's one function that we need to implement and so I have also written a prototype for that it only supports q depth one because that's what I was benchmarking it's not a full implementation it's a prototype to check what kind of effect it has on performance and the link to the the git branches is on the slide so we'll look at that here we are so this is the final incremental applying all of these optimizations on top of each other and how far it gets us on the left hand side the starting position we want to look at bare metal and without IO poll bare metal is at 78k IOPS now when we configure q mu with a file aio equals native so this is a a standard non-optimized setup and we don't use the IO thread then we start at 21k IOPS so that's extremely low we can see there's a lot of overhead I wouldn't necessarily say that this is what most q mu users experience today because IO threads is recommended and more and more of the management tools built on top of kvm and q mu have been using it by default so hopefully more hopefully most users today are around the second blue bar the IO thread bar so when we add IO threads then we are at around 46k IOPS there's still significant overhead right it's still pretty bad so next up we can enable vertio block multi-q now doing it at this stage actually turned out not to be very instructive although it slightly improved performance it wasn't it wasn't very significant in this graph but it's still essential part of the reason why it didn't improve performance very much in this graph is because I was already using pinning both inside the guest and on the host and so on so everything was already set up optimally adding the cues didn't help the IO scheduler was already none so that didn't help and so on but it is an essential part of making things scale and making things work so keep multi-q next up we introduced the user space NVMe driver in q mu and that boosts performance so we jump almost 10k from 46k to 55k so that's that's a nice boost getting us closer to bare metal now what happens when we try the vertio block guest driver's iopole prototype so adding that on brings us up to the above the initial bare metal number that we collected without iopole on the host side so what this is doing is now that we're we're polling and we're using more cpu cycles we are able to make some ground there we're able to reduce the latency that we had so this is looking good but it's also unfair because now really we should be comparing against a bare metal that is also using iopole so let's do that on the right hand side of the graph you see the gray bar that is 121 120 k iops that's bare metal with NVMe iopole so we're still behind we still have overhead but our absolute number of iops has increased um quite quite well and we're not done yet so next up we can try the NVMe user space drivers polling queues where we are polling in q mu in the IO thread and this brings us up to 94k iops so that's definitely a worthwhile improvement a nice jump there and then finally the aio fast path that i just mentioned so this is the final optimization that i'm going to show today the final prototype that that i wanted to share and that bypasses the the full request processing in q mu and as you can see so that that that is another 10k further towards closing the gap to bare metal so that's the status that's what i wanted to share with you and i and i'm working on upstreaming these optimizations so that they can be used but the entire vertical block and NVMe approach with the NVMe driver NVMe user space driver has still left us with a similar limitation as PCI passed it we still need one device per guest luckily this year a new tool has been added to q mu called q mu storage demon and this is a separate program that has q mu storage related functionality in addition to that a vhost user block server has also been added to q mu which is very convenient it means that now we can use the storage demon we can host the NVMe user space driver inside it and that one demon can serve multiple guests so we now have the ability to share a single PCI device and use the user space driver and have multiple guests so that solves that limitation it's already available in q mu dot git but the code path is different from the vertio block results that i presented to you so those optimizations don't apply it some of them still need to be ported to this so over time we can expect this to equal the the results that i just showed and then this will be an excellent way for if you need to share drives on top of this the q mu storage demon offers a lot of other functionality some of the cool things are nbd exports that would allow you to also attach those drives on the host or applications can use them and fuse exports are also in development the block jobs features are available so q mu storage demon will be a nice utility and i think we're going to see more use of it in the future okay now if you saw the q mu storage demon slide you might have thought wait a second this is a familiar architecture we know this approach yes it's very similar to spdk storage performance development kit and that also uses a polling architecture and it's been around for years in fact the vhost user block interface was created in order to connect q mu and spdk so we're very thankful that that already exists and we can reuse it so i i wanted to mention spdk because obviously this has some influence and it's a great project to check out if you want to find out more about what's going on uh in improving the general non nvme case please check out stefano cazarela's talk this year at kvm forum and he'll be going into what he's done with iou ring and some of the new stuff that he's working on finally the future direction so in the short term it's time to get these prototypes into a polished state get them upstream that will allow us to reach the performance that i've shown you here on these slides and in the longer term i think what's clear is that pc i device assignment because it gives us bare metal performance it's important to find more ways to pass through devices because when the hypervisor is not involved when there's no software path that's how we get the best performance so summary what have we looked at well there's the basic configuration and tuning that is essential the numer cpu idle hot pole and iothread setup that gives you a basic performance starting point and then you have the big choice do you want to use pc i device assignment because that way you'll have the minimal overhead that's the best way to go if performance is critical but you need to keep in mind the limitations of that feature and if you decide that you can't use pc i device assignment then you can use vertio block with the user space nvme driver that will boost performance and finally the qme storage demon now allows the sharing of user space nvme drives with multiple guests thank you um i also published the ansible playbooks that i used to collect the data if you want to go and look at the specifics of the benchmarks there's a URL on this slide thank you very much