 Hwyl, yw hwnnw, Miwn Thanos wedi gwybod o bob y Llyn Fffysgol yn y cyfnodau cynghori. Dwi'n ffocwsio'r cyfnodau ac mae'n gweithio'r cyffredinol yn y Llyfrgell. Mae hynny'n dweud bod y dyra'r hystiau a byddai'r ffocwsio'r cyfnodau. Mae'r cyfnodau cyffredinol yn y cyfnodau ac mae'n gweithio'r cyfnodau ac mae'n gweithio'r cyfnodau. So, two years ago, in the KVM Forum 2019, my colleagues, Thanos and Swap Mill presented M user. The idea of this was the ability to implement a device inside a separate user space process. So, a separate from QEMU. This was based upon the VFIO MDEV framework at the time. As a result had some disadvantages. It worked well as a proof of concept, but it did require a kernel module as well as a kernel patch. Much has changed since then. In particular, we at Nutanix have worked with the community to define an alternative approach that is much better in many ways. In particular, it doesn't require any kernel changes and has significant advantages in terms of the simplicity of the implementation. In particular, since people may have seen M user previously, the APIs of the library have completely changed. So, what is the VFI user protocol? Essentially, it's a protocol for providing management of external device servers. So, that's a device implementation running in a separate process. The same motivations as M user did, which around performance, security, resilience, you may have attended talks on this topic earlier. It's essentially a message protocol over a communication channel. Most typically, that's going to be a unit socket to another process on the same host, but other configurations are certainly possible. It could include TCP sockets having an implementation in another VM, for example. It's similar to the VFIO IOXL interface. In fact, it's based upon that in many ways. And you can think of it as analogous to V host user, except that instead of being specific to the VFIO protocol, it can implement generic devices, although typically this would be a PCI endpoint. And it's also worth pointing out this is VMM agnostic. There's nothing specific to the key mu implementation in the protocol. So, these are the main message types that you would see in the protocol based around basics of lifecycle handling of the device, configuring interrupts for the device, and providing access to both device memory, device regions, and guest memory. So, as I mentioned, we're going to talk mainly about the library interface and implementation, however. So LibVFIO user is a C library. It has two main roles. The first one is the socket server. So that's implementing the server side of the protocol I just mentioned. And then the other part is that it provides easy support for implementing a PCI device. It has an API for synchronous and asynchronous implementations. And the diagram here gives just one example of a possible usage where we have cloud hypervisor running a VM, talking VFI user protocol over a socket to SBDK, which has implemented a virtual NVMe PCI controller. So here's a quick hello world example of how you might use the library. So you can see in the beginning we create a context. We specify that we would like to use a unique socket and we would like to create a PCI device. We configure it with one device region here, which is bar two. We set up an IRQ for the device. We set up some callbacks for guest memory, which we'll talk about shortly, and then we essentially sit in a loop handling the protocol messages coming through. On the right hand side, you can see some sketches of what the callbacks might look like. The bar two callback. For example, you may look at the offset that's passed through and correspond that to a particular register in your device implementation and handle that register access, whether it's a read or a write. Equal ears we'll discuss briefly. You can have callbacks when guest memory is mapped. And this is an example of the information you can get in that callback. OK, so as we mentioned, the library simplifies implementing a PCI device in the sense that we have an implementation of most of PCI config space and handling of things like capabilities and various well known standard end point registers. It's not a fully complete implementation where essentially demand faulting things in as we discover they're necessary to be implemented. But it handles the majority of cases without consumers of the library needing to worry themselves about PCI specification details. It's possible to actually fully delegate the config space handling, and that's useful for the multi process key mu work on the server side. You can, as we saw, you can add callbacks for the device specific implementation of bars. And you can also have vendor specific capabilities, which also handled via callbacks. And IQ handling. So this is just an example of a walkthrough showing how IQ handling might work. So during initial setup key mu or another VFI user client may use the set IQs message to essentially pass us an event FD that corresponds to a particular IQ vector. And then let's say we have an MMIO right from the VM that corresponds to, for example, a doorbell rights for a storage device. That's handled the VMM side by key mu, which then becomes, for example, a region rights message across to LibVFI user. And the device implementation callback may process it as an IO request. And when the IO completes, it will raise a guest IQ via the VFU IQ trigger callback. And that would then correspond to an event FD rights. And key mu, for example, may have that routed through to an IQ FD to eventually raise the IQ on the guest side device region access. So this is allowing parts of a device, for example, bars to be accessed by the client or by the VM. So in the diagram here, for example, we have bar zero. And that's actually a temporary file managed by the VFI user server process. And in step one, we can actually share that with the client by passing a file descriptor along the socket message. And that allows the client to directly map that same file, which can then be also plumbed through to the VM so that a VM can directly access all or parts of a device without needing any VM exits or any socket messages at all. That's obviously device specific in terms of which parts should be shared directly to the VM and which should be handled via MMIO. So in steps three and four, this will be an example of how a non mapped access might happen. We would get an exit into key mu and then key mu would send a message during the request for the region, which would be handled by the device implementation typically. So the other side of the coin is also possible to share VM guest memory with the device implementation. So in this case, we have key mu has created some backing storage for VM memory, which is a huge pagers device, which is then mapped into both the key mu process and set up such that the VM can access that same memory. And it will key mu will typically send a message across the socket to the device implementation essentially announcing the range of that memory that has just been mapped by the guest. So typically the information there would have the sort of the the the DMA addresses and size of that region. Often this will correspond to the guest physical address, but this could be different with a virtual IO MMU implementation, for example. Again, optionally this memory can be directly shared by passing a file descriptor across. And in that case, the device implementation server can directly map the guest memory and consequently access guest memory without needing any socket messages. The API you see on the right is some of the ways with which a device implementation can actually access this memory. So if you think of IO rings or IO buffers, that would be one obvious example of when a device implementation would need to access this memory. We also handle dirty page tracking via these APIs for live migration, which we'll talk about very shortly. Now I'm going to pass over to Thanos, who's going to talk about live migration, give us a quick demo and talk about some future work. Thanks John. Support for live migration is a recently added feature in LibVfio user. In Vfio live migration uses a special device region, the migration region, which must contain at the beginning a set of specific registers. LibVfio user uses this mechanism and provides an optional API for simplifying live migration for the device. To enable live migration, the user must set up the migration region and provide certain callbacks that are executed by the library when the device is live migrating. These callbacks are the device state transition callback, which is executed when the guest changes the device's migration state. The main states are the running state, which is the default state. At the source, we have the pre-copy state, where the device is still used by the guest, but QMU copies migration data in background, and the stop and copy state, where the device is not used by the guest, and QMU copies the remaining migration data. In the meantime, at the destination, we have the resuming state where QMU receives migration data from the source QMU and writes it to the device. When it's done, it switches to the running state. The next callback is the get pending bytes callback. During the pre-copy and stop and copy states, QMU executes the get pending bytes callback, where the device gets to tell QMU how much migration data is left to be copied. QMU then proceeds with reading this data, either directly from memory if the device chooses to make the migration region memory mapable, or by executing the read data callback, where the device returns a buffer with the data. QMU keeps doing this until the device returns zero from this callback to indicate that there's no more data to migrate. In the meantime, at the destination, QMU calls the prepare data callback, where the destination device tells it where to write the migration data. Then QMU proceeds with either writing this migration data directly to the migration region if it's memory mapable, or by calling the write data and data written callbacks to explicitly provide the migration data to the device. It repeats these steps until all migration data have been written. In our original M-user presentation, we demonstrated a GPIO device, which is a simple device with an external pin that is either 0 or 1, and that can be read from the host driver. We'll now demonstrate the same device implemented using the latest LibBFIO user. We'll then live migrate this virtual device to another host, except where the destination device is implemented in Rust using LibBFIO user rather than C. Let's quickly look at the GPIO implementation. It's essentially what John showed earlier. We create a LibBFIO user context. We initialize the PCI configuration space, we set up bar 2 region, then we also set up the migration region and then configure it to use the migration callbacks. We then configure IRQs and then finally realize the device implementation and run the device. Let's build and run the GPIO device. Let's run it. Now let's run the source QNU. Initialize the device in the guest. Now, whenever we read the pin, it flips from 0 to 1 every three times it's read. So let's read it 1, 0, 0, then 1, then 0 again. Now let's go to the destination. Let's quickly look at the Rust implementation. It's very similar to the C implementation. We create the LibBFIO user context. We initialize the configuration space. We set up bar 2, then the migration region, and then the migration callbacks right here. And then we finalize device emulation and then run it. Let's build and run the Rust GPIO device. And now let's start QNU at the destination. Now let's migrate it. Let's keep reading. We read it at the source after it flipped and it was 0. So it should be 0 two more times and then it should flip to 1. 0, 0, and then it correctly flips to 1. We're still at the beginning of making device emulation easier with the VFIO user. First, we want to make the library more stable and add more automated testing. Also, we want to make use of the ongoing work on IO region, a theme KVM, that allows for faster MMIO when a device region isn't memory-mappable. Another area to improve on is the client using a virtual IOMMU. Although this is a VFIO user client matter, we can certainly do more to make it easier for the devices to cope with more complex DMA requirements. Better PCI support would be nice to have, for instance, handling more PCI capabilities or supporting PCI bridges. Multithreading is also on our list since some VFIO user functions are not thread-safe. Making device emulation restartable would be great, since that would give us better resilience and would make upgrades more lightweight. Supporting transport other than a Unix domain socket, as well as other device types, is also on our list. Finally, we'd like to explore how LibVFIO user can be used to perform hardware device mediation and SRIOV. LibVFIO user is on GitHub and it's BST licensed. We have a mailing list, but the best way to reach us is via Slack. Thank you.