 Hi folks, I'm Steve Suster and I work on Linux kernel and virtualization features at Oracle with my colleagues Anthony Isnaga and Mark Kanda. Welcome to our presentation of QMU Live Update. Live Update is a method to update QMU to a new version while keeping the guest alive. It has minimal impact on the guest. The guest pause is briefly for about 100 milliseconds. Update is transparent to guest clients. They suffer no loss of connectivity to the guest and only experience the brief pause. The method supports SRIOV without guest cooperation so there's no restriction on the guest operating system. We do this to enable critical bug fixes and security mitigations in a timely manner to keep our guests safe without requiring them to reboot. However, because we update to an entirely new version of QMU, we enable new features as well. Live migration can be used to achieve the same results but is more resource intensive. It ties up the source and target hosts decreasing fleet utilization. It consumes memory network bandwidth impacting the performance of other processes on the guest and the host. The duration of the impact is indeterminate as it depends on when the copy phase converges. Lastly, live migration is prohibitively expensive if large local storage must be copied across the network to the target. Live update is based on the following design elements. The old QMU process execs the new QMU binary allowing various aspects of the execution environment to be carried forward into the new. Guest memory is preserved in place in RAM so DMA operations may safely continue. External descriptors are kept open across the exec preserving connectivity. This includes, for example, serial consoles, QMU monitor, VNC sessions, pseudo terminals and V host devices. VFIO device descriptors are preserved which keeps them alive. However, the KVM descriptor is closed which destroys the instance, cutting the cord between KVM and the VFIO kernel state. Lastly, the QMU backend device state is serialized and saved to a file. These elements are executed by two new QMU monitor interfaces, CPR save and CPR load, where CPR stands for checkpoint and restart. QMP and HMP versions of each are provided. Live update has been a hot topic this year and you may notice some overlap between our work and others. However, we've been working independently in this area for quite some time and I believe we're the first to submit our patches to the community. To preserve guest memory in place, we propose an extension to the M-Advice system call M-AdviceDoExec. This preserves mappings in an address range across exec at the same virtual address. It works for memory created with map and on which otherwise would disappear after exec. The executed binary must explicitly allow incoming mappings via an ELF note this prevents unexpected sharing of content across the exec. The implementation is straightforward, about 300 lines of kernel code, most of that is for reading and checking the ELF note. The M-Advice call sets a new keep flag on the VMAs that span the range, splitting them if necessary. exec copies the marked VMAs from the old MM to the new, almost exactly like the VMDUPE operation in fork. For details, see the M-AdviceDoExec kernel patches that Anthony and I submitted. When M-Advice is used for QMU, the DMA mappings remain valid at all times. DMA activity from posted requests continues even while the guest is paused. It is safe to translate IOVA to virtual address and page throughout the transition so asynchronous kernel threads may safely create and access the DMA regions. QMU saves the address and length of the preserved memory regions in environment variables tagged with the name of the region. In the example at right, the PC.RAM region is remembered in the environment with values for both the address and the length. After exec, QMU looks for variables of this form and retrieves the address. The address is attached to the new KVM instance via the set memory region IOCTL. The first time we did this, we were surprised to find that the IOCTL time rose linearly with page count, adding hundreds of milliseconds for larger memories. Anthony investigated and found it to be an accident of the implementation easily fixed. He eliminated the linear cost with the following kernel patch, which is available in kernel 5.8. To support VFIO devices, we preserved their descriptors across exec, which preserves the kernel state of the device. After exec, QMU finds the descriptors and rebuilds the data structures that represent the device. The PCI bar and config memory regions are accessible via the VFIO device FD. After exec, QMU maps the bars and rereads the config. The DMA mappings are kept alive by preserving the IOMMU group FD and the container FD. The interrupt state is captured by the event FDs and the MSIX data. Event FDs are created and preserved for the error request IRQs in an MSIX IRQ per vector. The MSIX table and pending bit are saved to and restored from the VM state file. The values of the descriptors are saved in the environment. The box on right shows all the FDs saved for one VFIO device. The name is not pretty, but they completely describe and identify the descriptor. For example, the highlighted entry is the KVM IRQ chip notifier for Vector Zero, a device 3A colon 10 descriptor number 163. After exec, QMU attaches the VFIO descriptors to the new KVM instance using the appropriate IOC tools shown here. The required code changes to achieve all this are surprisingly small. Wherever a VFIO descriptor is created, we check the environment and use that value instead, then execute the existing code paths. We remember that the FD is reused and skip in any IOC tools that would reconfigure the device. We have tested this with interrupts delivered to QMU to the kernel KVM IRQ chip and posted directly to the guest. All work robustly across the update operation. To handle other QMU device state, we leveraged the VM state framework that LabMigration uses. We modify the code so that the save and restore handlers can be selected based on the operation, such as CPR versus snapshot versus migration. Objects are serialized to an ordinary file, not a socket like LabMigration, and not to a QCAL snapshot. This allows us to support a variety of image formats in guest boot devices. However, because the block devices are not snapshotted, one must not modify the blocks between the save and restore. The save file is small, less than one megabyte. Writing the file is very fast, adding little to the guest pause time. The CPR save command puts it all together. You specify the file for saving state and a mode argument. The mode is the keyword restart for live update. I will show another mode shortly. QMU pauses the guest VCPUs and saves device state to the file. It calls MAdvise do exec for all RAM segments, such as main memory, video RAM, and others. It clears the close on exec flag for VFIO and other descriptors and remembers their values in the environment. It destroys the old KVM instance and execs the new QMU. However, if user bin QMU exec exists, we call that instead. A site may provide this binary to customize the update procedure by changing the QMU binary path, changing the argv, modifying the execution environment. We use it to run QMU in a container environment, for example. QMU exec finishes off by executing the new QMU binary. New QMU starts and creates a new KVM instance. It finds and reuses RAM segments, finds and reuses VFIO and other descriptors, and attaches VFIO to the new KVM instance. QMU is now in the prelaunch state. The management layer now has the opportunity to send device ad commands that supplement the devices defined by the argv. This is why update is divided into CPR save and CPR load phases. CPR load is fairly simple. It loads device state from the file and continues the VCPUs. The guest is running again controlled by a new version of QMU. The pause time is about 100 milliseconds and this on a four-year-old Xenon processor. We've not spent any time profiling this, and I expect with optimizations and a recent processor this can be quite a bit faster. Here's an example using the interactive monitor. In window 1 on the left, we start QMU. The status command shows that the guest is running. In window 2 on the right, we use yum to update QMU on disk. This does not affect the running QMU process, and the guest is still live and running. On the left, we issue the CPR save command. It execs the new QMU binary and returns. Status shows the VM is in the prelaunch state. To finish off, we issue CPR load, the guest resumes, and the update is complete. Now for a short demonstration. In this demo, I run a script that updates QMU and our container environment and issues the monitor commands. The host is on the left and is running a single instance with QMU version dash 1, which is shown by querying the monitor. The guest is on the right and I start a counting program there. This will show that the guest is live throughout the demonstration. Now I type update commands on the host. The prepare command mounts the new QMU version. Suspend stops the guest and execs the new version. Our count stops. Resume continues the guest and our count resumes. Now we are running the QMU dash version. The update is complete. Now let's do it all with a single restart command. Watch the count. It barely stops and barely hiccups. Now let's go back and forth between the old new versions repeatedly, show how robust the update feature is. For each update, the guest pause time is measured and printed and is about 100 milliseconds. We are updating back and forth continuously, but the guest marches on. I've shown you how to update QMU using CPR save, which implies you must start with a recent version that supports the command. However, we can also update a legacy version by dynamically injecting code that performs the equivalent of CPR save. The VM save shared object provides the injected code. It accesses QMU data structures and globals, such as the list of RAM handlers, the list of VM state handlers, character devices and V host devices, just to name a few. However, deal open does not resolve the address of these globals when VM save is loaded because QMU is loaded with the RTLD local flag. So we wrote code to find the addresses by looking them up in the symbol table. The code deletes some VM state handlers, such as those specifically for live migration, and registers a new handler for VFIO. It calls MAdvice to exact and guest memory. It finds devices and lists and preserves their descriptors. To invoke the code, we hot patch over the text of a QMU monitor function by writing to proc mem. We then call the monitor function using an already created monitor socket. The code patch is small. It deal opens VM save and then calls its entry point. VM save runs, saves state and execs the new QMU. The new QMU includes live update, so we simply call CPR load to finish the update and resume the guest. The VM save library has binary dependencies on QMU data structures and variables, so we build a separate VM save library for each legacy QMU version indexed by GCC build ID. We extract the build ID from the running QMU process and inject the matching VM save object. The technique is fast and reliable. The guest pause time is roughly the same as for CPR save, and we've successfully tested updates from QMU 2.x to 3.x and 3.x to 4.x. It's pretty cool. Many critical fixes can be applied by updating only QMU, but if you need to update the host kernel, we have a method for doing so in the CPR framework. The mode argument in this case is reboot. After CPR save, UK exec boot the kernel and issue CPR load. The guest pause time is longer, but connections from the guest kernel to the outside world survive the reboot. The guest RAM must be backed by a shared memory segment. The segment is preserved across K exec reboot by Anthony's PK RAM kernel patches. The pages of the segment are visited to find the PFNs that must be preserved. The 8-byte PFNs are packed onto free pages, those pages are linked together, and the head is passed across the K exec. Early in reboot, the PFNs are mapped to pages, and those pages are removed from the free list. The Shemem I note is recreated, and pages are attached to the files address space. Page visitation and reclaim are paralyzed for speed, and hundreds of gigabytes can be preserved in less than one second. See the PK RAM patch series for more details. We support SRIOV devices if the guest provides an agent that implements suspender RAM, such as QMUGA. The update sequence starts with suspender RAM followed by CPR save, K exec reboot, CPR load, and finishes with a system wake-up. On suspender RAM, the guest device driver's flush-posted requests and re-initialized to a reset state, the same state reached after the host reboots. Thus when the guest resumes, the guest and the host agree on the state. That's a wrap on our work to date. In the future, we would like to merge with Intel's VMM fast restart work, which keeps SRIOV devices alive and configured across reboot. This is faster than suspender RAM and eliminates the guest agent. See Jason's current KVM forum presentation. The unadvised do-exec extension works great, but our upstream reviewers are not keen on accepting it, so I'm pondering alternate solutions. I'm considering standard system calls to reattach to guest memory after exec, such as Schmatt or M-Map of MMFD, plus new way octals to tell VFIO that the VA is changing. Stay tuned on that. Lastly, we received great feedback from our QMU reviewers on version one of the live update patches and were busily working on version two. We very much look forward to making live update available to the community. Here are some references to follow our work if you are interested. I hope you have enjoyed this talk and thanks for attending. I'll now take your questions over chat.