 Good afternoon, and welcome to our multi-process QMU update. I'm John Johnson, and co-presenting with me today are Jagarman and Elena Afimcheva. We have been working on splitting QMU into multiple processes where a guest device emulation runs a remote process separate from the VM manager process. The motivation behind this idea originally was security as it reduces the surface for malicious guests to attack and also offers to vents in depth that the device emulation is compromised. But along the way, we've discovered some fellow travelers who want to use the same concept for performance and scalability. Original idea, which we presented at KVM 2019 is shown here. It was to use two QMU instances in a client server model. In this scheme, the server QMU would do the remote device emulation, and the client QMU would do the traditional VM manager. The server ran the existing QMU device emulation code and modified while the client was modified to use a proxy for each device being emulated that forwards the device operations to the server over a Unix socket. To accomplish this, we invented a custom protocol to describe the requests needed to be communicated between the client server. And you can see all this on the slide. The client is on the right, the server is on the left, and the custom protocol is aligned in between. Today's presentation covers how we changed our original idea to when based on VFIO. VFIO is an existing QMU model that allows guests to directly access host hardware mediated by the host kernel's VFIO driver. The existing QMU client uses a set of IOCTL commands to communicate with the kernel driver. As we can see in this slide, the QMU looks almost exactly the same, except that the VFIO client uses IOCTL to talk to a kernel VFIO driver, which then uses that to talk to a host device that's on the device that's on the host. We changed the existing VFIO client to encapsulate these IOCTL commands and send them to the server over a socket. On the server side, we use a library developed by Nutanix that can process the encapsulated VFIO commands and provide a standardized API for use by device emulation programs. We then changed our QMU server to use that VFIO, LibVIO user API to communicate with the client QMU. As with our original concept, the QMU device emulation code remains completely unchanged. This project could not have made it into a reality without the collaboration of the multiple project teams listed here. Chief among them is the Nutanix LibVIO user library team who are resigning later today. As I mentioned in the last slide, it provides the C binding for our community new device emulation server. LibVIO user is not limited to use by only QMU. It can also be used by other device emulation programs, such as the Intel SPDK project that has already been presented here at KVM Forum. Also noted on the slides, we also have the repos that you can look at the actual code on GitHub. Here we're going into detailing the changes we made to use VFIO. The biggest change is to use the existing VFIO model instead of developing our own proxy model. This allows us to reuse almost all of the code of the kernel IOCTL information and that QMU has been used for years. This provides several benefits. One is the existing VFIO code has features we haven't yet added to our custom protocol, such as IAMMU support. This is an ongoing process. While we're developing VFIO user, VFIO added migration support, which we then added to VFIO user with only minor modifications. Another benefit is maintenance. Plug fixes and performance enhancements in VFIO will also apply to VFIO user. Many will require no additional VFIO user specific changes. Even features that do require new IOCTL commands will be relatively simple to add to VFIO user. The protocol is based on these IOCTL commands. So new ones will just need to be encapsulated into the protocol. Though I'd also note that protocol additions will also need to be implemented in the VFIO user. The changes from kernel VFIO are very minor. The biggest one is that VFIO user does not require any kernel modifications. Does not need an updated VFIO driver. And should even work if no VFIO driver is installed at all. It also does not require any elevated privileges in order to be configured. No slash sys or slash dev VFIO devices or files are used, as in the case with kernel VFIO. Everything runs in user land. It is VFIO user after all. We're now about to describe the actual VFIO client implementation. For reference, it's the shaded area to the right side of the slide. As I've mentioned before, we want to reuse as much VFIO code as possible. To do this, we just defined an abstract superclass for both kernel and user limitations to use. This class contains most of the current VFIO code. The only subclass specific code is in the option parsing and set up and tear down on the device object. Most of the changes to the superclass are whether to issue the IOCTL to the kernel in the kernel implementation or send a message over the socket in the VFIO user implementation. VFIO user uses an IO thread to receive messages from the server. Incoming messages are classified as replies, as replies to the client requests or requests from the server. The former will signal the thread that sent the request while the latter invokes a callback to VFIO to process the request. All devices currently use the same IO thread but the scalability is an issue when multiple VFIO user devices are in use. This can be changed to one IO thread per device. A CPU thread that sends a request to the server will often wait for the server to reply. Since the requesting thread calls into VFIO holding the big QMU lock that serializes much of the QMU device simulation code, we could cause performance issues if we blocked waiting for the server process to reply while holding PQL. We solve this by dropping PQL whenever possible and use our own per socket mutex for internal locking. One situation where we cannot drop PQL is when map or DMAP messages are being sent during an address space change transaction. These transactions are also serialized by PQL. So VFIO user must keep holding PQL and send the request asynchronously. When the address space transaction commits, we then wait for the youngest request to complete. This indicates all requests have completed since requests are replied to in the order in which they are sent. I'll now have to hand things over to Jag who will cover the QMU server changes. Hi, I'm Jag. I'm going to give you an overview of the QMU server. As you can see in the block diagram, the server runs as a standalone process at the other end of the channel. Presently Intel, SPDK and QMU each implement a server using the lib VFIO user library. Theoretically, the server could be any process that can handle the incoming commands from the client. But for this discussion, we are focusing on the server implemented by QMU. The QMU server consists of the following major components, a new machine type called x-remote. This machine runs QMU without any CPUs. Second, a new PCI host bridge. Unlike traditional host bridges, which link PCI device with a VCPU, this bridge connects the PCI devices to the QMU client. The third component is an IO hub that handles the interrupts generated by the PCI devices. The next component is the lib VFIO user library from Nutanix, which converts the VFIO protocol to its own set of callbacks and data structures. Last, but not least, it includes a VFIO user object which glues the library and PCI devices on the server. More on this later. The lib VFIO user library is presently a sub-module of QMU's Git repo. QMU compiles and links it as part of its build process. In the future, we hope that this library would be available as part of various Linux distributions. The server plays a vital role in the lifecycle of a PCI device, which includes initialization, steady-state operation, migration, and termination. Let's take a brief look at what happens during the initialization phase. To take a step back, the VFIO user requires an orchestrator to launch. The orchestrator would launch the server first and then the client as the client needs specific device information to initialize itself. During this initialization phase, the VFIO user object that glues the library and PCI devices performs the following critical steps to operationalize the server. The first step is to acquire a handle for the device from the library. Future interactions with the library use this handle to uniquely ID the device. As part of its specification, libVFIO user defines function callbacks for various PCI operations, such as config space accesses, bar accesses, and interrupt delivery. Additionally, it allows the migration of this device to another QEMU server. VFIO user object defines handlers for each of these operations and registers them with the library during this phase. Now let's turn our focus to the steady-state operations of the server. One of the essential functions of a PCI device is to perform DMA, that is to transfer data between the RAM and PCI device. The client maps the VM's RAM into the server's address space to facilitate DMA when the device runs in a separate process, as explained below. The client registers a memory listener with QEMU, which notifies the server whenever it detects changes to the VM's RAM. This notification describes the change and includes a file descriptor. LibVFIO user uses the file descriptor sent by the client to map the device's RAM regions to the server. The library notifies the DMA handlers in the QEMU server, which modifies the device's address space to reflect this change. The other significant functions that the server performs during this steady-state operation are to handle bar accesses, config space accesses, and process interrupts. The bar access path is the most frequently executed operation and therefore forms one of the performance-sensitive parts along with interrupt delivery. Elena would give an overview of how interrupts are handled later in the presentation. Last but not least is migration. VFIO user leverages the recently implemented migration support for VFIO. It kicks off the process by when the client initially notifies the server one device at a time that it's time to pack up its bag and gather all its state. It then transports the state to the QEMU server. Ideally, we would be able to migrate individual device instances, but there are some practical challenges we face presently, which we will outline towards the end of this presentation. With that, I will turn over to Elena. Hi, this is Elena, and I will talk about how client initiates connection with the server, requested device information, sets up the interrupt, et cetera. The VFIO user protocol specification defines series of messages that are used to exchange between the client and the server and encapsulated into the VFIO iocodes. When the client sends the command, most of the time it expects the server to reply with a specific message. In many cases, the messages are similar to the data structures used in a iocode implementation of the VFIO iocodes. The VFIO user client initializations starts with the version message exchange on the server. This serves the purpose of identifying the protocol version supported and allows for backward compatibility with use of major and minor versions. The JSON string is used to request the capabilities that server supports. Next, the get info command is used by client to request device description from the server, such as a number of regions and number of iocodes supported and if the device supports the reset. To get a description for each region, client requests server by sending command get region info. The server replies with the read and write permissions if it can be memory mapped, region index, size and offset. When the guest executes load or store operations to an unmapped device region, the client forwards this operations to the server with the user region read and write messages. The server will reply with data from the device on read operations or an acknowledgement on write operations. If the device has a VFI region type migration, it can be requested with the same region write in read operations and it can be used to set the device to the specific migration state in request the current device status. Setting the interrupts. Two commands are used to accomplish this. There are get IRQ info and set IRQ info. To set up the IRQs, client requests server the number and types of IRQs device wishes to set up. The server sends a reply with the flag set. The flags indicate how the server is able to process the interrupts using the event of de-signaling is the mask and mask operations are supported in some details about the setup itself. Each message can specify the action that is requested on the set of the IRQs in the payload. The action slot can be used to mask and mask the interrupt and trigger it, indicate the type of data is being transferred. For each of the interrupt, the IO event of these being obtained from the kernel sent over to the spur with the command set IRQ info. The server can now signal an interrupt by directly injecting interrupts into the guest via the event file descriptor provided by the client. DMA memory may be accessed by the server by sending user DMA read and write commands over the socket. The actual direct memory access of client memory from the server is possible if the client provides file descriptors that server can end map. The end map privileges cannot be revoked by the client and file descriptors should only be exported environments where the client trusts the server not to corrupt the guest memory. If the server is not trusted, the security DMA option can be used and the file descriptors will not be exposed to the server. The dirty pages command is similar to the FIO MMU one. It is sent by the client to the server in order to control logging of dirty pages, usually during a life migration. The message flags instruct the server to start or start the logging for the specific range or to indicate the client that the client expects the server to return the dirty pages map. We also did a quick performance test with the FIO command to compare the experimental QMU multiprocess implementation that is currently in upstream QMU with the VFIO user. The rendering read write tests on the schedule is stimulated by the client and server using VFIO user protocol did not show significant performance degradation compared to the one implemented with the QMU multiprocess. For the future work, the IO region of these can be used for accelerated handling of PIO and MIO accesses. The idea was proposed by Stefan Hanossi and the project started by the intern Yelena Fanasova within the outreach internship program in 2020. Three RFC versions of the patches were posted and we are working on addressing the comments from the last review and incorporating with this work with ours. VFIO user will work over any socket time but the reason we use unique socket is they allow to pass file descriptors between the client and the server. Supporting other types of sockets will be less performance since direct DMA cannot be done by the server and will require new commands to inject interrupts into the VM instead of using event of these. At the moment, we only support PCI devices but other buses can be added in the future. One issue with ECI buys forward is that ECA configuration is hard coded into the platform model on X86 systems. This is the subject of future discussions. PDRV inactivate all, we look to enhance this to provide the selective capability for life migration for the devices. Now let's see VFIO user in action where Jack will be able to demonstrate that in demo. Welcome to the demonstration of VFIO user in QEMU. We want to explain how to launch VM with PCI devices running in a separate process, explain the various command line options we added and show the migration of devices in the setup. It is much more convenient to launch the VM using an orchestrator or a launcher script but we do it manually to understand the QEMU command line better. We will first launch the server to prepare itself to process commands from the client which we will launch subsequently. Here is the command line to start the server. As you can see, it uses a X remote machine type. As you can see here, it uses the X remote machine type to launch the server. If we try to launch it with any other machine type that operation will fail. For example, if I specify a PC machine type, the server will say it's VFIO user object is only compatible with the X remote machine type and it will error out. The slash DMP slash remote SOC one is the named socket that the VFIO user will use as the communication channel. The server will create and open the socket and listen to it once it has finished initializing. One of the steps involved in the initialization phase is to verify that the device coupled with the socket is a PCI device. The VFIO user object rejects any other type of device as I will show right now. For example, if I specify drive two which is a SCSI HD device as one of the devices to be bound to the channel, let's see what happens. We get an error saying drive two is not a PCI device and as such it is not supported. I hope you won't mind me deleting the sockets here between drums. Ideally, the orchestrator would take care of it. We will present two LSI controllers to the guest, each of them having its drive. The first controller has a 10 gigabyte drive and the second one has a five gigabyte drive as you can see here. The first one is the OL7 test for QCOW2, the 10 gigabyte drive. It is connected to LSI one. The second one is the OL7 dash test dash two and it's connected to LSI two. Let's finish launching the server. Once the server is up and running, we can launch the client using its command line. As you can see, we provide the same sockets to the VFIO user PCI device on the client side which is essentially the client. Let's wait for the guest to boot up. And the first command we can try is LSPCI to see if both the LSI devices show up. These are the two LSI devices. And let us see if the block devices show the 10 gigabyte and five gigabyte drives. Now let's move on to the next phase of the demo where we will show how we perform migration. The process for launching the destination for migration is very similar on source except that we specify the server to wait for incoming connection. This socket provided on the server end of the destination is presently a dummy socket. It serves no function and it does not receive any data. So the 445 port here serves no purpose. So once we have launched the server on the destination side, we need to do the same for the client. The client will, the client has a very similar command line as the source except that they use a new socket as the remote SOC1 and remote SOC2 are already used by the source and cannot be reused again. We specify that the destination listen for data in the port 4444. So as you can see, the source is still operational whereas the server is still waiting. And we can take a quick look at the processes that are running on the host and see four separate QEMU processes. Now let's initiate migration by logging on to the sources monitor. So we can instruct the source to start migration using the migrate HMP command. Let's go back to the source console to see if there's any activity. As expected, it is hung and it's essentially stopped. And then now if we go back to the destination we can see that it's working. We can, let's see if the devices show up or we can still see the 10 gigabyte drive and the five gigabyte drive and both the LSA controllers show up in the PCI bus. And if we quit the HMP socket the source should exit as expected. And then the number of processes on the host should just be to the destination client and the destination server. I hope you enjoyed the demonstration. Please feel free to ask any questions.