 Hello. Welcome to this session. Today I will introduce our team's work, the design and implementation of device keep life state, which can be used for implementing local line migration and VMM fast restart with pass-through device support. We will first give an overview of the design and then dive into some major details. First, let's have a recap of the problem. System update is the pain point of cloud vendors because system update is usually taking too long time. So we see more service downtime for customers. The existing solutions can be divided into several categories. First choice is to move the VMs somewhere else to allow the system can be updated. The typical solution of this category is live migration. However, live migration doesn't support pass-through device very well. The other choice is to keep the VM at local. Then an option is to live patching the kernel. It is good for small fixes. However, big kernel changes will raise the failure rate. Our focus is the final category, update the relevant components separately or in a whole on local system. Examples of this category include the proposal from Oracle to do the communal live update. And Baba also had a paper which talks about updating the KVM module. Looking into the solutions, two questions need to be answered. First, do we allow passing VM? And second, do we allow to reboot the host? If we allow passing the VM, then pass-through devices and kernel update can be supported naturally because pass-through devices can be suspended within the VM and then resume when VM restart. If we don't allow passing the VM, which means we can't require VM cooperation, then we will have two further choices. First, don't allow host reboot. Then, we can't update kernel. We can only update user-land VMM, typically CUMUL. Here, we still need to pass the state of the pass-through device to resume the CUMUL. We also need to pass the guest memory mapping across the restart. Steve from Oracle also had a proposal for this case. VMM fast restart was to solve the final category, in which we don't allow passing VM and we allow host reboot. This is the most flexible solution, which means kernel and CUMUL can both be updated. Here, we can leverage K-exec reboot to boost the reboot time. We also want to support pass-through devices. There is a special situation here. During the reboot time window, the pass-through device has no owner. Since we can't rely on the guest driver to suspend the device, so the choice here we have is to keep the device alive across the reboot. So here comes our proposal and implementation. We introduce a device keep live state for pass-through devices. We will talk about the overall idea of the proposal and then dive into some major details. As shown in top right corner of the slide, we introduce a flag in the core device data structure to denote a device is in the keep alive state. What does it mean by putting a device into keep alive state? It means the device hardware is still alive, although it has no owner. It may continue to issue DMA and IRQ. However, the host software must not modify the hardware state of the device. It can't bind the device to any other drivers either. On the other hand, the device software state, which is managed by drivers, must be saved at one of two stages if it can't be restored without clobbering hardware. Either at the time the device enters keep live state or at the time the host reboot. It depends on whether it is affected by cumul runtime operation or not. Or it is configured or maintained by the kernel or not. The whole Kexec reboot procedure consists of two incremental stages. Stage one is related to cumul. When cumul starts, it opens the pass-through device. When cumul quits, it closes the device. This will cause the device to be enabled and disabled, thus changing its hardware state. So if we want to keep the device alive across the restart, we need to set it into keep alive state before cumul quits. This is called stage one, keep live state. Stage two is related to kernel reboot. Some devices, such as ILMMU, its driver maintains its own state as long as the kernel is running. But when kernel is about to Kexec reboot, the state will need to be preserved so that it can be handed over to the new kernel. This is stage two, keep alive state. Here we can see that stage one, keep alive states can be used for implementing cumul live update. This is an alternative solution to the FD passing over EXEC. It also can be applied for implementing local live migration. This slide shows the example cumul comments for implementing the VMM faster restart. We put the guest memory in a DAX device, which is a DRAM emulated persistent memory, and turn on its share property. In this way, cumul can pick up the guest memory from the persistent memory after Kexec reboot. The cumul comment migrate set capability XignoreShared makes the saveVM command ignore the guest memory region that has turned on share property. So we don't need to copy the guest memory into the snapshot. The cumul comment set keep alive is a newly added comment to set all the password devices into keep alive state. It also specifies a UUID token. When cumul restarts after Kexec reboot, it needs to specify this token to the VFIPO device parameter so that the kernel can verify and resume the cumul has the permission to own the password device. After the resume the cumul load the snapshot, we issue the set keep alive off command to clear the keep alive flag for all the password devices. After that, the password devices will start working as usual. So as shown in this picture, a lot of data structures or software states are evolved in the lifetime of the VM. To implement VMM faster restart, we need to figure out which ones need to be preserved and which ones we can recreate across the Kexec reboot. There are some rationales to help determine this. First, does it need to be saved at all? If the state are pure software state, which means it doesn't depend on hardware state or it can be restored by reading back the hardware registers, then we don't need to save it. If we can only restore it by collaborating the hardware register, then we will need to save it. All the gray states in the picture are those we don't need to save because we can reconstruct them without collaborating hardware registers. And second, we don't want the resulting state saving code to be too much intrusive to other kernel components. Pass-through devices are managed by VFIO driver. It is reasonable to put major device keep life management logic in VFIO layer. So as not to touch other components too much. And third, if we need to save it, which stage does it belong to? Is it manipulated by QML runtime operations? Or will it only be destroyed by kernel reboot? For example, the VFIO PCI device data structure has hardware dependent states. The underlying PCI device will be disabled when QML quit and it will be enabled by QML restart. So it belongs to stage one. In later slides, we will look into these keep alive states in more details. Now let's dive a little bit deeper to see how we keep alive the two major device keep life states, IRQ and DMA. The challenge for keeping IRQ alive is that during the KXR reboot, both hardware and software are not available to handle the IRQ. CPU is undergoing reboot and re-initialization. Software is not ready either. There are two options to keep IRQ alive. First one is to mask IRQ when the device is put into keep alive state and unmask it when we end resumes. In this way, device can hold from issuing IRQ during the restart period. The problem of this approach is that MSI masking is an optional feature of PCI devices, which means there are some devices that don't support MSI masking. Another approach is to leverage posted interrupt, which we choose here. It doesn't depend on MSI or MSIX. So it is more generic and has more coverage. The IRQ setup and tear down a ticket from the VFIO PCI device layer. It goes through the PCI and IRQ core layer down to the IRQ remapping driver to allocate or free the IRTE, which means interrupt remapping table entry. Meanwhile, the KVM site will allocate posted interrupt descriptor, PID for the VCPU. It will be connected to the specific device interruptor vector via the IRTE. So here we basically have three things to preserve. PID, IRTE, and the device interruptor vector index. The device interruptor vector index is not shown in this picture. It is saved in the QML snapshot. Next, we will talk about how we save PID and IRTE. There are also two options to notify KVM site to preserve PID. One is to introduce IO control comments so that QML can issue to KVM site to save the PID. Another one is to leverage IRQ bypass mechanism, which is currently used by VFIO driver to notify KVM site to enable or disable posted interrupt mode. We introduce two callback interfaces for the IRQ bypass consumer data structure. When the device enters KIPA live state, the callback save consumer will be invoked from the VFIO site. It will eventually trigger a newly added callback in the KVM x86 ops, which will save PID and set the suppress notification bit in the PID. And for IRTE, again, two options for preserving it. It is a long code pass, as we mentioned, to set up or tear down an IRQ. It starts from VFIO driver, goes through the PCI call and IRQ call layers, and finally arrives at IRQ remapping driver. If we want to make the PCI call and IRQ call layer to be aware of the KIPA live saving and restoring, then we will need to introduce APIs and pretty much code changes into these two call kernel layers, which may be much inclusive to the two layers. So we choose another approach by which we use most of the IRQ setup and tear down code pass. We just check the device KIPA live flag at the IRQ remapping driver. If the device is KIPA live, then we don't free the IRTE when IRQ is torn down. Instead, we save the IRTE aside. We also record the mapping between the IRTE and the IRQ vector index within the device so that they can be reconnected when the device IRQ vector index is reset up. In this way, we can introduce less intrusive code change for all the involved layers. For DMA states preserving, we need to preserve the DMA page table, domain ID, and various IOMMU configurations. For example, the root table of the IOMMU, the context table, etc. There is a dilemma situation for us about whether to preserve IOMMU domain or not. The IOMMU domain data structure is a software data structure which can be recreated without clobbering hardware states. If we do recreate it, we will have pretty much code change to IOMMU driver. On the other hand, if we preserve it, most of the code change will be in Wi-Fi O. Then, do we want to consider other device pass-through framework such as VDPA? It looks more reasonable to let the IOMMU driver to do more things than both VDPA and Wi-Fi O duplicate the efforts. Currently, our POC work choose to preserve the IOMMU and leave this issue as an open. Device ownership authentication is a security issue we need to consider. This is because when the pass-through device is put into keep-life state, it will be detached from its owner. When the resumed VM is trying to reattach to the device, there must be a mechanism to verify the ownership. We leverage the VF token mechanism, which is an existing feature of current VFIO driver to do the job. A token will be set into the VFIO device when it is put into keep-life state. Then, when the cumulory starts, it needs to pass the token to the kernel VFIO driver in order to reopen the pass-through device. Kexec reboot procedure also needs some modifications. We introduced a keep-life callback notify before Kexec reboot happens, where the stage 2 keep-life states can be preserved, and all the keep-life states can be copied to the persistent memory to pass to the new kernel. After the new kernel starts to reboot, all the keep-life states will be copied back from the persistent memory so that device states can be restored. Pass-through device list is another important information that needs to pass to the new kernel so that the new kernel can identify the keep-life devices and do special handling during PCI enumeration. We also need a memory handover mechanism to pass all this data from old kernel to new kernel, and the different oracle has a proposal for this. For keep-life devices, PCI enumeration procedure needs special handling. Basically, we can't re-initialize the device. Instead, we need to restore the states from the data passed from old kernel. Meanwhile, we can't re-assign PCI bar resources to the keep-life devices. Instead, we need to inherit the resources from old kernel, which are already recorded in the bar registers of the devices. So until now, we have talked about how we handle the many issues we will encounter for the VMM fast-story start. However, we still have many opens. First, we still can't make the keep-life flag transparent to PCI core code. We check device keep-life flag in MSI-MSI-X IRQ set up and tear down code paths to avoid clobbering hardware MSI-MSI-X registers. Do we want to check the keep-life flag in all PCI core code paths since we already introduced this flag? Would that be too intrusive to PCI core code? And what about PCI enumeration failure because of resource conflict? How do we notify QMAR about this? And since all the dependent devices along the IO paths also need to be keep-life, how do we handle the states of these devices, for example, switch port and root port? They may register IRQs for different PCIe capabilities. Can we just disable them when keep-life operation starts and read events back from their status registers after we reboot and re-inject those events into guest? And what about SIOV and SIOV support? PF device states also need to be preserved. How do we do that? Currently, we have finished the POC of QML Fastery Start and the full VMM Fastery Start. With our testing environment, which is hardware and broadband platform with Intel NIC card, YouTube video streaming and SCP workloads in VM can be restored after KXF reboot. We hope this effort can go to upstream. So we'd like to have your comments and suggestions and cooperation is welcome. Thanks everyone. Any questions?