 Hi everyone and welcome to this talk on NVMe emulation performance optimizations. I am Jin Hao Fan from the Institute of Computing Technology at the Chinese Academy of Sciences. Over the summer, I have been working on the emulated NVMe device as a Google Summer of Code intern with the QEMIM project. At school, I am a second year master's student doing research on starved systems. I use the QEMIM device a lot during research, so I am super interested in how such a system really works and how they are built by open source developers. I applied for GSOP and got accepted to work on it. I am mentored by Klaus and Keith who are the maintainers of the NVMe device, but I also received valuable input from Stefan and the QEMIM community in general. The project proposal was originally suggested in 2018 by Huai Cheng Li and Paulo Vanzini, but was not picked up until this year. Currently, the emulated NVMe device is mainly used by developers and researchers to prototype new NVMe features and experiment with drivers and systems interacting with NVMe devices. The device is PCIE based and supports all mandatory features of NVMe version 1.4. While the device model does a good job as a research and test platform, the performance is surprisingly low compared to the Vertioblock device. So basically, the goal of our work is to make QEMIM's performance at least comparable to Vertioblock. The optimizations to implement was already pretty well-defined by the project. First, reduce the amount of memory mapped writes causing VM access by adding support for so-called shadow doorbell buffers. Second, reduce the impact of the remaining MMIOs by making them cheaper using the IO event FD mechanism. Third, run the emulation in a dedicated thread using an IO thread. Finally, introduce Pauling's submission cues to reduce latency. Some work has already been done on the first two steps, but the third and final step had, to our knowledge, not been experimented with before. Like all successful software projects, some surprises showed up. Most notably, that IO thread support requires a thread safe interrupt delivery mechanism based on event FD. To understand why the performance suffers in the current NVME device, a little background on NVME and the implementation in QEMU is required. Like Vertioblock, NVME uses pairs of lock-free cues for command submissions and conditions. The cues are described by two pointers, the head and the tail. The producer will update the tail pointer and the consumer will update the head pointer. While the device can notify the host about new entries in the completion cue through interrupts, the host relies on writing to a so-called doorbell to inform the device that it has produced entries in the submission cues or consumed entries in the completion cues. In the context of PCI devices, a doorbell register is the common name for memory mapping register located on the device. Naturally, recall the action of writing the register as ringing the doorbell. In other words, to execute a single command, the host must ring the doorbell twice. Once to inform the device that the command has been placed in the cue and once to inform the device that the completion entry had successfully been processed. The NVME protocol allows the number of doorbell writes to be reduced significantly by the host themselves. The host can cue several commands before ringing the doorbell and it can process several completion entries before writing the completion doorbell. However, the device has no control over this and it is entirely up to the device driver or application to do this. If the application is latency sensitive, it may not be interested in delaying the execution of commands and will actively ring the doorbell on each submitted commands. In any case, each of these memory mapped writes are expensive under virtualization because the NVME devices use the traditional trap and emulate mechanism to process commands. That is, the memory access will cause a VM exit that suspense the guest while the MMI handler is executing. While the NVME device defer processing of cue entries to the main loop thread, so it can exit the MMI handler as quickly as possible. This VM access are still the prime suspect for the low performance. To reduce the number of VM access, Shadow Doorbell Buffer, a pair virtualization feature was introduced in NVME version 1.3. This feature is strikingly similar to the event index feature in VertiO. Two additional buffers are introduced, namely the doorbell buffer, which is managed by the host, and the event index buffer, which is updated by the emulated controller. When these buffers are configured by the host, the host is required to update the buffers instead of the memory mapped doorbell registers. Luckily, writing these buffers does not cause VM access. With Shadow Doorbell, the host only writes the doorbell registers if the Shadow Doorbell values is struggling or falling behind the event index buffer value. The Shadow Doorbell mechanism results in a very decent performance gain on its own. However, even when the Shadow Doorbell buffers, there are still some MMIOs. If updating the Shadow Doorbell value causes the value to become greater than the event index. This indicates that the device is in an idle state and not currently consuming cue entries. The host is then required to kick or wake up the device by issuing a normal memory mapped doorbell register write. For these MMIOs, we make use of QEMU's IO Event FD mechanism. With IO Event FD, the VM access handler just writes to an event FD and resumes running guest code without returning control to QEMU. This enables handling VM access caused by MMIOs in lightweight fashion. Importantly, IO Event FD cannot be used as a general replacement for doorbell register writes because it only signals when a memory location is written. It does not tell us what value is written. However, Paolo noticed that if Shadow Doorbells are used, the MMIO handler does not need the value written to the register anyway, since it is already written to the Shadow Doorbell Buffer. So, instead, to know the exact head and tail doorbell values, we make use of the Shadow Doorbell features and checks the value in Shadow Doorbell Buffer when IO Event FD is signal. With IO Event FD, we are already close to Word IO block, but there is potentially more to be gained. We want IO Emulation in a dedicated thread without interference from other devices. We also want to pull the submission queue for new entries to bring command submission latency to a minimum. To give you a background on why a dedicated thread is important, let's use an example with multiple devices. In the current architecture, the devices are all emulated in QEMU's main loop thread, which makes it a bottleneck. However, with QEMU's IO thread feature, we are able to use one thread for each device without interference between these devices. However, to enable emulation in a separated IO thread, we first need to solve some thread safety issues. QEMU's default interrupt injection emulation is unfortunately not thread safe. This was not a problem when all devices were emulated in the main loop thread, but it does not work for IO thread. Solving these issues required a deep dive into how Word IO deals with it, and we learned about two approaches to get around this problem. Both of them are Event FD based. First, there is a feature in KVM called IRQFD that supports IMSAX interrupt assertion via an Event FD. This approach bypasses QEMU's interrupt emulation. We use this approach when IRQFD is available. When IRQFD is not available, we register a normal Event Notifier to handle the actual work of interrupt assertion and de-assertion in the main loop thread. IRQFD does not bring a substantial change in performance. In fact, as Stefan told us that although IRQFD bypasses QEMU's interrupt emulation, it is not necessarily faster. However, it does solve the thread safety issues. With Event FD based interrupts, we are finally able to add IO thread support. We do not emulate all commands in the IO thread. We found that admin commands such as QCreation and Doorbell Buffer configuration usually involve memory region transactions, which require we hold the big QEMU lock. Therefore, we keep the admin Q emulation in the main loop thread. Also, since the admin commands are not in the fast path, there is nothing to be gained anyway. Luckily, we can easily change the emulation thread by changing AIO contacts during Event Notifier, Timer, and Bottom Half setup. Just remember to hook up the correct AIO contacts when you use these functions. With IO thread, we solve some performance improvements at low QD. We suspect this is because IO thread has a more lightweight event loop. But the most important thing is that we are able to scale linearly with the number of devices. The next and final step is polling support for low latency NVMe emulation. With polling, we can start processing immediately when the command is available. We can also eliminate the need for MMIO because we are proactively checking the Doorbell. QEMU already has good support for AIO contacts polling. For NVMe, we only need to provide our own parlor functions. The NVMe submission queue parlor pause submission queue tail shadow Doorbell registers to check new submission queue entries. With all these optimizations, we achieved our goal of bringing QEMU NVMe's performance on par with Vertile Block. Under some settings, we are even better than Vertile Block. Since the mechanisms we used are similar, we suspect this could be because the kernel NVMe driver is better optimized than that of Vertile Block. 700 lines of code are changed to reach a peak gain of 11x in IOPS. We are surprised to find that polling results in lower IOPS at high queue depth were still investigating the cause. We were also appreciated if any audience can help us figure out the reason. Next are some hard lessons we learned. First, we found that some important APIs are lacking a bit on introductory documentation. For example, for Threadsafe interrupts, you need an event have debased approach. For RQF debased interrupts to work correctly, you need to set up MSIX vector masking and unmasking notifiers. We have to dig into the code and ask the maintainers why they originally did this. Luckily, Vertile source code is extremely well written, readable, and relatively easy to follow. Vertile source code used to be the only documentation for these mechanisms. Now we have NVMe, which, at least subjectively, might be a bit simpler. But it will be nice to have these conventions documented, and we will attempt to contribute documentation or some introductory blog posts on this subject. By the way, if you want to incorporate IO Thread into your device, here are the things you might want to check. First, make sure your MMO handlers work on the right thread. This means you need to attach the correct AIO contacts for each event notifier, timer, and bottom half. Second, make sure your interrupt delivery is thread safe. The de facto standard approach is to use an event have debased notifier to schedule all the interruptor assertions and deer assertions on the same thread. We also found an interesting violation of NVMe spec in the wild. The NVMe spec states that doorbell buffers should be used on all queues, including the admin queue. However, for all existing drivers, like Linux kernel, SPDK, and emulated devices like the SPDK VFIO user target, none of them uses shadow doorbell for the admin queue. If a device expects the driver to update the shadow doorbell value, and it does not happen, chaos. This behavior simply cannot be fixed in drivers now. There is no way of telling if the device is spec compliant or not. And if the driver implements the spec compliant behavior, then it might not issue the memory mapped right, causing non-compliant devices to not pick up the commands. And if we, as the device, insist on following the spec, then we break this non-compliant drivers. What to do? The final hack brought up by Keith, who is a co-maintainer for both the kernel NVMe driver and queue-email NVMe, was to override the shadow doorbell buffer value with the doorbell register value in the MMIO handler. This is safe because we are in the VM exit context. In the end, we managed to work for both compliant and non-compliant drivers. However, drivers are, unfortunately, still not spec compliant. And it is probably something we have to live with in the long term. The performance optimization presented in this talk potentially enables the NVMe device to be a viable alternative for a virtual block in cloud deployments. But since the original aim of queue-email NVMe was for development and testing, we still need more features to support the cloud use case. For instance, the device lacks live migration support. And a thorough security audit is definitely required to even consider deploying this at scale. Another point is more versatile IO thread support. Currently, we use one IO thread per emulated controller. But this is probably not enough for emulating high-end SSDs. Therefore, we may consider adding options for an IO thread per namespace and even per submission queue. However, since submission queues are not tied to individual namespaces and this underlying block device, this is not trivial. The last page lists the patches for all these works. Shadow Doorbell and IO Event FD support is already included in queue-email 7.1. And IRQFD, IO threads and polling is currently under review. Hopefully, the code itself and the discussions can help other system developers learn how to use these techniques. That's all from my talk. Thank you.