 Hello, I'm Michael Turkin. I work at Fred Karate as a distinguished engineer and I'm a chair of the Virtaio Technical Committee. And today I'm going to talk about the work we are doing in Virtaio for the benefit of hardware Virtaio implementations. And I think Virtaio is kind of unusual in that it has been defined as a software interface, first of all, and hardware came afterwards. So we have hardware emulating software in a sense. I plan to describe some challenges at surface when you try to do it like this. And hopefully this is going to be interesting for people interested in both software and hardware. So I like starting with this slide because it just keeps growing year to year. This just shows all types of the devices for which we have a detailed description included in the Virtaio specification. This by now is a huge specification. Just since the last year we have included memory device, IMU device and sound device. And lots more are works in progress. So I guess the point I'd like to make here is that it's a very popular standard. And there exists around it a strong ecosystem, it's a large amount of existing devices. Most of these devices started out with a pure software implementation. And into the scene began to enter hardware implementations of Virtaio. So it all began with a quest for performance where a cloud vendor, for example, would say, well, right now I'm burning up host CPU cycles, moving packets, for example, between a Virtaio device and a host NIC. How about we teach NIC to talk Virtaio directly instead? Sounds reasonable, doesn't it? And naturally the NIC hardware would then be built to provide the features that existing guests use since that's the software people want to speed up, right? In response, we are on the Virtaio side of things, notice this trend. And we thought, okay, so that's great, but isn't there anything we can do to make hardware work better? And we'll put it in the spec, we'll build it into software and then this time after the software is widely deployed, it will be worth it for hardware to target the software extensions. Of course, we don't do it in a vacuum. We have discussed these extensions with hardware vendors, so they make sense for them. And these extensions is what this presentation is about. So the reason hardware Virtaio offload work at all is because Virtaio devices actually pretend to be PCI devices for the virtual machine. And here I run the LSPCI command with MVM and it's output lists a bunch of PCI devices among them a Virtaio device. So if you implement a device that looks just like this, then existing drivers with MVMs will bind and will work with your device. And that's a lot of software that can suddenly use your hardware without modification. Now, if you look at the specification of any Virtaio device, they're usually two parts to it. And one part is device configuration. This is a control pass interface used for initialization and things like that. But during data pass operation, driver manipulates data in Virte cues. This reside in guest RAM and access by device using direct memory access. Now since Virtaio hardware is primarily motivated by performance, then let's first of all, talk about the data pass, that is the Virte cues. So this is a original Virte cue format. It's called a split Virte cue and it was designed with software in mind. It is kind of hard to use efficiently from hardware. And this slide tries to show why. Without going too much detail, there's a Virtaio card at the bottom and the red arrows are PCI express transactions that need to happen for the device to do something for the driver. I think even without knowing too much about it, you can see that there are lots of round trips from the device to main memory. All of these need to happen in order. Result of one read is necessary to initiate the second one. And each of these increases the latency of the device as observed by the driver. So the latest Virtaio specification already made a first step towards improving this latency by introducing a packed ring Virte cue. And that one simply consists of a ring of descriptors. For each descriptor includes a physical address and a lens. They also include an identifier which allows supporting out of other completions. I'm going to talk about this later. There's a flags field which marks a descriptor as available or used. And that's pretty much it. Now to process some descriptors, device first reads the flags field to verify that the descriptor is valid. If it is descriptor itself is read and processed and afterwards device writes into the flags field to mark the descriptor is used. Now I think it's kind of obvious that less transactions are necessary here than with a split ring. In fact, on some platforms read granularity is bigger than the descriptor size. And this removes the need for a separate read of the flags field. So it can be read with the rest of the descriptor atomically. However, this is not guaranteed by the PCI Express spec and these platforms are generally not so easy to detect. For this reason, the Virtaio spec now also includes a way to remove this extra read in a portable way. Specifically devices can request that the offset of the last available descriptor is sent to them by the driver. Whenever the driver notifies the device that new buffers are available. And using that, we end up with a single read and two writes being sufficient to process a descriptor. That's way better than the split ring just from the latency from the transaction count point of view. But this is not the only optimization included in the Virtaio spec. Here's another one. So it turns out it's pretty common for devices to use descriptors in the same order in which they were made available. For example, this is usually the case for the transmit ring of the Virtaio network device. And that's because reordering packets by a network device is generally front upon. So in this case, device and driver can negotiate the feature which allows device to signal use of a batch of multiple buffers by only writing out the last descriptor in the batch. And this reduces the transaction count and the overhead on the PCI express even further. A simple basic variant of this feature that I have described right now is already included in the released version of the Virtaio specification. But that turned out to be somewhat limited. So to understand the limitation specifically, let me take a step back and talk about page faults first. Page faults good or bad. There's an argument that goes something like this. Page faults slow things down. Now hardware Virtaio is all about performance. Therefore hardware Virtaio devices do not need page faults of work. But I'm not sure I agree. And let me start with an example. A Virtaio hardware device might support thousands of virtual functions. Each one can be assigned to a virtual machine. So it might be practical if you have thousands of VMs to give each virtual machine a virtual function. However, it is not really practical to put memory of thousands of virtual machines. Honestly, don't know the one closest to the device. So what we do want is to put the most used memory of the most active virtual machines on this number. And one way to do it, to detect this most used memory of the most active virtual machines would be in exactly the same way it's done for process memory. Device access to a page is blocked and next access will cause a fault that we can detect and handle by moving the page closer to the device within us time, the result will adjust the workload and perform better than the static partitioning. I hope that's a good example that shows how page faults can actually be good for performance. But another argument revolves around the practicality of implementing the page faults efficiently. There are two main issues that I'm aware of. The first has to do with the fact that PRI, the page request interface, which includes support for page faults from the devices is still not where supported on all current systems. Well, to address this problem for now, devices can detect and report page faults to the host in a device specific way. For example, through the physical function of the device. If the device has an on-device IMU, detecting page faults is probably easier. Another problem that is often mentioned in this context is device stalls. Now stalls are only a problem for some devices such as the receive queue of the virtual network device which get lots of data from the outside world. Let's see for example, why it isn't a big problem from the transmitting of the virtual network device. Well, at the bottom of the slide, you see a ring with a bunch of patterns. Now, as long as ring is stalled, then the packets just stay in guest memory. Meanwhile, more packets can be queued in guest memory. These queues that are deep enough, no packets need to be lost. By comparison, let's see what the problem is for the receiving. To the right here, you see packet number one in coming to the device. Now driver has prepared multiple buffers in the receiving. However, translating the first buffer immediately cause the page fault at the IMU. Processing now stalls until the first buffer becomes available. Meanwhile, more packets arrive at the device. Since the ring is stalled, they are not stored in the buffers prepared for them by the driver. Eventually, the packets will overflow the device memory and call packet drops. Turns out however, that virtio ring rules actually allow devices to fix this problem. The rule that I'm talking about is that devices can use buffers in and out as that's different from that in which they were made available to them by the driver. And this capability is normally not used by networking devices. It's used by storage devices, such as virtio block, for example. And we can utilize this capability to prevent stalls. After detecting a fault on a buffer, device can try storing the first packet using the next available buffer and then the next packet using the buffer after that. In this picture, the fault of the first buffer got resolved after processing four packets. The fifth packet is now stored in the first buffer and then we can proceed as usual. Know that buffers are reported to driver out of order in which they were made available, but in order with respect to incoming packets. So packets are not reordered from driver's point of view. That's great. However, handling a page fault like this doesn't coexist with the existing in-order optimizations that I described previously because that optimization disallowed to use buffer reordering. For this reason, the Virtio Technical Committee is currently discussing an extension that allows both page fault handling and in-order optimization to coexist. Specifically, a new per descriptor flag is defined. It is only set when all previous outstanding buffers have been processed. Now the device is only allowed to skip writing those descriptors if the flag is set and all previous outstanding buffers have been processed in order. Now, as long as there are page faults and buffers are used out of order, the flag is clear and therefore all descriptors are written out by device. When things stabilize and there are no more page faults, then the flag is finally set and the device can skip writing out some descriptors in proven performance. Let me try to advocate for page fault support and hardware some more. Well, page faults enable memory over commit, which is very popular among cloud vendors. For example, the FireCurricular hypervisor seems to focus on over commit which is a very important feature. Now, a device with a large number of virtual functions should be able to support a virtual function per virtual machine. And if this can be made not to conflict with memory over commit, this hypervisor are likely to use this. Again, same can be said for transparent huge pages. This work by moving pages around, which in turn can only work if we can make some pages non-present, move them around and make the pages present again. During this time, hardware can trigger faults. Again, the host could only be temporary over committed. These users are actually expecting performance to pick up once some memory becomes free. Post copy live migration is another useful feature that needs page faults. Let's talk about how it works generally. Just shortly, a virtual machine is simply started on the destination, even though at least some of its memory has not been migrated there yet. Now when guest accesses the missing memory, a page fault triggers. This is reported to KMU using the user fault FD mechanism. And KMU then fetches the page from source and restarts the guest. Now this has multiple advantages. First, the virtual machine can start a destination sooner, free in AppSource CPU. Even if guest accesses a lot of memory, it will still make progress and eventually migrate. That's unlike pre-copy, which sometimes suffers from live logs from busy guests. Virtual machine memory on destination is not right protected. Once a last memory has been migrated, then no faults need to trigger. So that's nice, but it doesn't work with a hardware device. Why? Simply put, at the moment, hardware will try to pin all guest memory. At this point, it will just block on the destination until migration is complete. And this is not the destination of live migration. And here's how hardware page faults would fix us. When hardware accesses a missing page, a fault triggers, page is then fetched from source and devices notified to retry the access. Know that at least at the moment, the page request interface support in Linux doesn't support user fault of D. One way to address that would be to again use a device specific way to trigger user fault of D on the hardware page fault. Know that also that again, if there is locality to device accesses, the first access will cause a fault, but falling accesses cause no page faults, so performance is good. Let's talk about possible live migration solutions generally. Two types of migration, pre-copy migration is when the virtual machine memory is migrated while the virtual machine and the device is running on the source host. Now since the device can modify the virtual machine memory, we need to detect that and update the copy and the destination. This can be done a variety of ways. Page faults will work of course, or the device can report the changes pages to the host through some asynchronous interface, or the hypervisor can process the device rings for a duration of migration. Pre-copy I just described, that's when an access causes destination to fetch data from source hypervisor. Again, page faults do fix it or it can be supported by running a software emulation until migration completes. So this can be potentially for a long time. There's an interesting issue when you deal with live migration with hardware devices and that actually has to do with control plane of the device. So to move a VM virtual machine between hosts, devices that it uses on both hosts needs to be compatible within themselves. But how can we ensure that when the devices are potentially from different vendors? Devices sold by different vendors differ significantly. They even differ for a single vendor. They can expose different optional capabilities such as offloads, amount of resources such as queues, interrupts and so on. So the only hope seems to be to find the least common denominator supported by all hardware and use that to start a virtual machine. This looks like something that should be solvable with some tooling. For example, a tool could exist alongside PMU which can query and report the capabilities of host hardware devices. Now this tool is run on each node and it collects in for about all the supported features the number of resources such as their queues, interrupts and so on. After getting the information from all cluster nodes and collecting them together, another tool can then merge their reporting information and come up with the least common denominator that can migrate across the cluster. So in an example on this slide, some features are only present in part of the cluster and therefore after merging, they are disabled. The number of your queues also differs between nodes. So the least number of your queues between all nodes is used. Now starting a virtual machine with the specified parameters for the device can ensure that the device can be instantiated identically on all cluster nodes. So via can be migrated between them, it will. The VIRT IoT C is looking for even more optimization helpful for the hardware. One idea is to try and approve batching of use descriptor writes by the device. The issue that trying to solve is that sometimes a single use descriptor refers to multiple available descriptors. So let's look at this example. Here descriptor number nine implied use of descriptors eight and seven. For example, because of the in order feature. And in that case, device writes out a single descriptor number nine and then skips forward in the ring over two descriptors since the use of two descriptors eight and seven was implied. So that's not a problem as such it works, but this does mean that multiple write transactions are now necessary to write out the use descriptors because there's space here between descriptors nine and the following descriptor. Now before we discuss the solution, let's just shortly talk about the reasons for skipping the descriptors like this. That's what's creating the problem. And this has to do with a rule that device must only use descriptors that driver has made available. And therefore it must always stay behind the driver in the curing. This way each descriptor source written by the driver and mail available to the device. And then it's written by the device and marked as already used. However, if the driver is allowed to get ahead of the device by more than the full ring lens, then it will start wrapping around and overwriting use descriptors. That device just rolled. Skipping descriptors keep device and driver in sync more or less and not allowing the device to fall too far behind the driver. We are considering an alternative way for the device to write out use descriptors one after another without skipping space. To maintain the correctness and variant device being behind the driver, the device can detect driver wrap around. That is that it process the ring full worth of descriptors and skip to the beginning of the ring at that point. Take a look at an example. Here device skip three descriptors, two, five and eight. After writing out descriptor nine, it detected that the end of the ring is three entries ahead. And it then skipped straight to the beginning of the ring, making sure the driver is not too far ahead of the device. Driver can also keep track of skipped descriptors. Alternatively, device can set a descriptor flag to signal driver to skip to the beginning. In this way, a single PCI express write can signal use of multiple descriptors potentially after a full ring. So to summarize, a large existing ecosystem makes Vertaio compelling option for new devices. We are always looking to enhance the way we handle hardware Vertaio devices. This work on both performance and new features. And contribution is open to everyone. So please do join us. This ends the talk. If you have any questions, please feel free to reach out. The best way to do this is to post them on the Vertaio mailing list since it is preferable to have all discussion happening in the opening. The address is on the slide. Thank you very much for your patience and have a good day.