 So, hello everyone and welcome to my talk about the QEMO emulated NVMe device. My name is Klaus and I'm a software engineer with Samsung Electronics. And this is my first time speaking at the KBM Forum and I am very excited to be here. So this is going to be in context of NVMe. So NVMe is the non-volatile memory express. It's a storage interface that's designed to exploit the low latency and inherent parallelism of net flash memory. And to understand some of the things we'll talk about in this talk, there are some core terminology that we should go through. Basically in NVMe you have the concept of a controller, which is a PCI express function that acts as the interface between the host, which is the device is connected to, and an NVM subsystem. The namespaces are quantities of non-volatile memory that are accessed independently from other namespaces and typically biological web address. NVM subsystem combines these things into a set of one or more controllers, zero or more namespaces and one or more ports, a port here being, let's say, a PCI spot. So a little bit of history on the NVMe device. It was first contributed by Keith Bush at that time working for Intel in 2013. And the implementation that he did held up for several years. It did the job and people could develop drivers, test out their drivers in an emulated setting before or when hardware was of limited availability. In 2018 and 19, I became involved in the open-channel SST ecosystem. And I was using your email on a day-to-day basis extensively. So I started to add a bunch of missing mandatory features to the device. And one of the things I started working on a lot was adding support for multiple namespaces to support the open-channel SST on emulated open-channel SST. And so because I started to contribute a lot from 2019 and up through the next couple of years, I became a co-maintainer in mid-2020. So one of the things I added was that I bumped the device version to implement NVMe v1.4. I added the multiple namespace supports. In 2020, also, we got the support for sown namespaces in the NVMe device. We added, I mean, we added the subsystem and namespace sharing support. And we added metadata and interim data protection just to mention a few major features that we've gotten recently. So as you can see, things actually moved pretty fast. And as we shall also see, sometimes it moved a bit too fast. So there were some mistakes done. And we'll be talking about one of them specifically today. And they stemmed from not really knowing about best practices or how to effectively use the available APIs and QEMO. And there are a lot of them. So it might be difficult to get this broad overview of how to wire up devices and how to make new device models. And it also stemmed from, especially for me, not really fully grasping QDALE and the relationship with the QEMO object model. So speaking of these APIs, one of the first issues I had when I really started doing QEMO work was I had trouble rocking on the standing QDALE versus the QDALE object model. In the beginning, I thought it were two different things because I thought it was a versus. And one of the things when I finally learned was that no QDALE builds on QEMO. And there are documentation available. It might just be a lot for a new device implementer, a new developer coming right into QEMO and wanting to make a new device. It might be difficult to grasp all this documentation available. The header files are extensively documented, and that's just a lot of it. So what QDALE builds on the QEMO object model and it actually provides an API that is tuned for sitting up user-created devices. And there's this emphasis on devices, which is later because devices are the things that we consider to be instantiated with command line with the dash device. So while it provides this nice API and a lot of nice teases to configure these devices, it also imposes a very strict structure on how to wire up your device inside the QEMO machine that you're emulating. This is an order tree where every also needing level is either a bus which can have multiple devices attached or a device which can create multiple children buses. So as you can see, you always have one parent relationship in this tree. So when the NVMe device was first introduced by Keith back in 2013, it only had single main space support. So you would have the device, the NVMe device, and it would have a drive property on it. And it would have all the associated drive or the common block device properties like the logical block size, discard support, stuff like that. So what I wanted to do was add support for multiple main spaces so that we could have several block devices attached with individual parameters. So we wanted different block sizes, say 512 bytes for one of them and 4K for one of the other main spaces. And when I first posted my patches, I got a lot of helpful comments from the community. So thank you for that. And what we ended up with was adding a new device, the NVMe NS device, and we would add a bus to the NVMe device such that the main spaces could attach to this bus implicitly or explicitly. And then you would have this relationship between the bus and between the controller device and the main spaces. And that's all of also how the SCSI subsystem does it today. So it was very nice because it fits into the QDib tree and you got introspection just worked by typing the info Q tree in the monitor. You'd get this nice tree. You'd get the main spaces underneath the controllers as children of the controller and get all the parameters and everything that just worked and was really nice. When the device was removed, a hot clock, then the children on the device would be automatically unrealized. So everything cleaned up by itself and it all made sense when we introduced this. And this automatic cleanup should be a nice thing, a good thing. But when we added more advanced features like subsystems and shared namespace functionality, then it became a problem. So if we look at the plumbing and how it was done prior to version 6.0 of QEMO, we had something like this. You would have the main system bus and you would have the PCI Express bus there. Then you would have the NVMe device, PCI device attached to this PCI bus and it had its own bus where the namespace was attached to. Looks nice and just the way it should be. But then we wanted to introduce shared namespaces. And a shared namespace is a namespace that can be accessed concurrently by two or more controllers within the same subsystem. And it's very useful for testing advanced drivers and multi-path IOs. So we really wanted to add this to the device model. So it required adding the concept of a NVM subsystem to the device. And again, because I or apparently also anyone else interested in the subsystem at the time didn't really knew any better, we ended up merging this as a device that was a busless device. It was not attached unrooted, was not attached to the system bus, but it was like an anonymous device or unattached device. And then we would add a subsystem link parameter on the controller to plumb it up with this subsystem. And it followed the design of how you would create the NVMe in this namespace device. And it felt like that was the way you create devices in QEM. So it looks something like this. You would have these controllers. And one of the controllers would have the namespaces attached. And through the NVMe control device, the namespaces would know about the subsystem through the link parameter on the controller device. And through that, they would register with the subsystem. And through that, you would basically attach it to all the controllers in the subsystem. Now, and as we can see, the namespaces devices can only be attached to one device or one bus in IQ Dev terms. So the problem is what happens when we remove this device? Well, then the namespaces would be automatically unrealized. And we would end up with the namespaces going away and maybe having references from the other controller that still thought that these namespaces were attached to it. So that's a big problem. So we had to get a fix in for this. And we did a fix. We basically added another NVMe bus on the subsystem to which the namespaces were attached directly. And this worked. But it was not really nice because due to backward compatibility issues, we had to keep the namespaces devices to attach initially to the NVMe devices. And then if there were a subsystem attached, then we would re-parent these namespaces devices to the subsystem bus instead. So but if we could fix this probably, then one of the things we could probably do was to implement a custom hotplug handler and maybe fail over the namespaces when one of the name controllers was removed. Or if we started from scratch, then we could have made the subsystem device, a system bus device that exposed this NVMe bus. And then we could keep the namespaces devices as they were and then we would attach to that bus instead of the controller created bus and which we removed. And so it could have looked something like this, that the namespaces were directly attached to the system bus and we would have references from the NVMe controller directed to that subsystem instead. Now there are some potential issues. Again, there are some backward compatibility issues that we could have to solve. It would probably require to implement new devices, deprecate the old ones. That's not a bad thing. It can be solved. But we would technically think of the NVMe controllers as children of the subsystem. But due to the way QBus works, we wouldn't be able to do this because the namespace or the subsystem device is not really a PCI device. It shouldn't be on the PCI bus. And the controllers needs to be on the PCI bus somehow through the tree. So it also retains this idea that subsystem namespaces are actually devices, which is not so nice. But this is how SCSI does it. And SCSI separates the controller from the drives and it uses a QBus to wire them up. So something here must be the right way to do it. It feels right, or it smells right at least. But this one parent restriction has apparently also impeded the addition of multi-path IO functionality in SCSI. And I know that Hennis Leininger, who I talked to, because we talked about this problem in NVMe, that he would run into this restriction of the audit tree, which gave him a lot of issues with implementing business SCSI. And it also seems like no other subsystem, QEMO block subsystem, block device subsystem, seems to support this notion of a shared block device like NVMe does. So what if we were to rethink this model? And what if we said that subsystem namespaces shouldn't be modeled at devices? Because neither of them actually expose virtual hardware. They don't expose memory. They don't expose IRQs or anything to the guests. And conceptually, a subsystem is the parent of controls, as I said before. But they are not a PCI device. So how would they fit? And namespaces can be associated on children of multiple controllers. But again, this runs against the audit tree with a one parent relationship. So we can really express that. And fundamentally, namespaces and subsystem are just concepts of a device model. And they happen to benefit design-wise when we do this of being independent devices or independent instances. So there is an alternative to this. And that's user-creatable objects that you can instantiate on the command line. That's object. And those actually might be more appropriate. Now, there are no existing devices that uses objects like this. So there is the memory-backed object, user-creatable object. And we actually use that in the ME device to implement the system memory region. So knowing that there were at least some devices doing stuff like this, I gave it a shot to actually try to implement it like this. So I posted this pretty big patch series that I called the Dopepocalypse. And considering the size of the NVMe subsystem, it's a really huge patch. But it is also a major refactor. And I did a lot of work to try to make it as renewable as possible. And what it does is that it introduces subsystem and namespaces as user-creatable objects. Now, the goals of this series is to introduce a new experimental controller device. This is mostly to get rid of some deprecated options, as well as changing how the subsystem link property works. It also creates or introduces new experimental user-creatable objects from namespaces and subsystems. And it introduces three of them, two namespaces objects for different namespace types, and one of the subsystems. It also exploits the QEMO object model extensively by using or implementing an abstract object that contains the base functionality or common functionality of namespaces. And then you have the NVM namespace types and the zone namespace types that derives from this base class. And the zone namespace actually even derives from the NVM object, because it just extends how the NVM commands the set works. And we retain backwards compatibility by keeping the existing devices around. So all the existing devices uses the new object code internally, so there's no code duplication, and there's not suddenly two trees to maintain. And the goal, of course, is to deprecate the subsystem namespace devices when these experimental objects stabilizes. And there are some perks, of course, having introduced brand new models so that we can clean up some of these confusing device parameters, such as the MSI execute size, which is really just maximum interrupt vectors for the device. And there are some so far unofficially deprecated, but there should have been officially deprecated parameters that we can also get rid of. Then there are some fixes for how the namespace manages some parameters and the subsystem as well that we can just fix up. So instead now, you create the subsystem as a user-creatable object, you give it an Identifier. And then you add the controllers, like before, as a device, because they need to go to the PCI bus and attaching it to the subsystem. Then you add namespaces. Again, you use the NBM or the zone version of them, and you attach that to the subsystem. And then there's a new parameter that can use multiple times to define what controllers it should be attached to initially at Google. There are, of course, some things about using objects instead of device. And one of the things is that properties are more opposed to define. And that stems from the fact that in QDM, properties are sort of considered immutable. When you've set them, you shouldn't be able to change them. And you can really hook into the gatherers setters. You can do that with the raw user-creatable object. And that gives you a lot of flexibility, of course, at the cost of slightly more opposed property definitions. There's also no realization space as there is in QDM. So what you can do is that you can use a machine.notifier to emulate this. I stole this from how the remote object works, which also does something similar to this. And also, if you're not using object composition, which we are not really using with this implementation, then you are responsible for doing cleanup as devices that you depend on are removed from the device model. So I think one of the lessons learned here is that you should consider all your options when you define or decide how to define your model or design your model. For instance, should it be split apart into multiple parts? And should those parts be devices or objects? Like, does it actually behave like a device? Does it expose virtual hardware or memory beaches? Like, does it quack? Because the flexibility of user-created objects might be what you're actually looking for if you can get by and live without the luxury of the niceties of the QDM API. Now, as my series goes through review, it might be very possible that we end up going back to the system-based approach, because that's how other devices does it right now. But I hope to at least push for this version, because I think it gives us some really nice and also some nice introspection features because of these more flexible gathers and centers of properties. So for future work, one of the main things that we want to change in QEMO for a long time is that we have this QEMO SGList and IO vector duality. So we use the SGList extensively, because most of the data that flows through the controller goes between the controller and the host. So it needs to be to use the DMA helpers, and we use those because they're nice. But there's also a bunch of commands that only transfer and move data internally on the controller and those like copy and verify the data and stuff like that. And those uses IO vectors, so we also have to move them. And whenever data goes into and out to the controller memory bar for the persistent memory region, then we also need to use the IO vectors. So one of the ideas here is to open code this DMA mapping transfer from DMA helpers into the device and process the PRP six data incrementally as we have DMA resources available, which would completely remove the need for this temporary SGList data structure. One of the other things that we're working on is to try and limit the latency of the device. So this is especially for profiling, pulling, and maybe drives that doesn't rely on interruption of the device. And one of the ways or one of the few ways to limit the latency of the device is to use in-memory and know of IO and use IO threads or dedicated calls for queue processing. There's also some para-visualization features in ADME that can limit the number of memory mapped IO and thereby the number of VM exits that need to do. This is not implemented, but there are some existing patches floating around that basically just need to try and interpret. I think that the FEMO SSD emulator, which is a form of QDMO actually implements this as well. So the big goal here is that can the emulator device possibly be made faster or at least latency-wise than available hardware? Or at least how low can we actually go in terms of latency and can get good enough for profiling? So what I was hoping here, especially for Q and A, is if anyone has any knowledge or experience with doing stuff like this, and they know that this is a hopeless endeavor than I'd be happy to know about that, so I don't go wasting my time too much on things that I just think is fun. So this was my talk. I think we're at the top of the hour and thank you for attending. And I hope to discuss stuff with you at the unit. Thank you.