 Okay, I think we can get started now. So hello everyone, welcome to this session about life updating. So my name is Fan, and I'm from ByteDance. So today I'd like to talk about our work in KVM host kernel life update, specifically how we handle the IOMM use state during K-execute reboot, which is an interesting problem if we have the, we have by our PSI password devices attached to our virtual machines. So in this talk, I will do three parts. First is the introduction. We will look at the background of the problem and the existing techniques and with a simple comparison. We will also review the VFL PCI in order to understand how we can make it work with life updating. Then in the next part, we will see how we do life updating with VFL PCI devices and what changes we need here. So we will discuss what are the stateful parts in the process and how we preserve their states. We will also see it will involve both device status and the IOM MMI tables, as well as of course the guest memory. Then finally, we will take a look at the proof of concept work and the future plan. So starting from the introduction. So what we are trying to do here, the main motivation is about updating the running software on your hypervisor host, including the VMM and the kernel. Because that obviously as a developer of the private or public cloud, we want to make sure the production environment is always kept up to date, delivering the highest performance and the free from security or functional bugs. But in order to do that, there is a kind of a difficult problem to solve. Because as we know, as a complex system, like the hypervisor, it's hard to just write the perfect code in the first place and deploy it on the machine and hope it will run forever until the hardware dies. So something will come up. So we always face the challenge of some high priority security issues to fix or some functional bugs that are affecting the customer experience. Or maybe we just even to introduce some new features or maybe performance improvements on the running systems. And thanks to the nature of virtualization, it's not impossible, as we know. So there are two well known ways to reduce the cost of doing such upgrades, both in terms of engineering or for the cost of the customer side. So they are known as live migration and live update. So let's see how they fit in the problem. Starting from live migration, we all know that live migration is to move one guest from one host to another between different slots. And this is quite powerful approach because it can solve both the hardware and software issue while doing this because we can obviously migrate from a broken hardware to a healthy hardware or migrate from an old version of hypervisor to a new version. But it's in practice very tricky to do it very well because saving and loading the state of the VM and copying it over network to another machine is usually very heavy operations as we have seen in many other talks because memory is streamed across between two machines and the virtual block devices, we need to take care of sharing them or mirroring them over. Or maybe we also need to care about the complex network data plane coordination just to keep the VPC working after migration. So there's always a few challenges associated with live migration. It can be the resource and the time needed to do the operation. So you always need to allocate roughly two times different resources under the time that can be plenty if you want to move one VM to another. But if the guest is under high load, you always have the problem of converge and this can be solved in different approaches but as soon as we enable post copy, the error recovery can also be tricky. So life update on the other hand, it's different but also can be seen as a very special case of live migration because in the previous slide, live migration moves a VM from one host to another but in live update, it's always that the VM stays on the same host but just the moves from one slot to another. As you can see in the picture on the right, we have a new slot in the square with a higher version of VMM. So while we do the live update, effectively we are moving to a higher version of the VMM. So we can live update what we want. So the important difference here between live update and live migration is that live update allows us to avoid a lot of data copying in the process by taking advantage of the fact that both slots now run on the same host so they can access the same set of hardware and software resources. For example, the memory allocated can be, the guest pages can be shared or handed over between the old and the new VMM processes. Similar to the virtual storage, it's because apparently the new VMM can easily access the same storage as well. The same applies to virtual network, et cetera. So much less change is operation is needed here in order to finish the live update operation. So therefore, live update is really considered a much more efficient approach and can be very beneficial for our downtime, which is a big deal for the customers. And as well, we can see that we don't need as much as extra resources to be allocated in the process just as a temporary buffer or in the transitional state. But live update is also not perfect. It also comes with a few challenges. So first of all, in order to avoid allocating the extra resources in addition to what's already running, we try hard to just pass the resources around between the old one and the new slot. So the details of each resource type can influence how we implement this. And it's really dependent on the choice of the configuration and the backend type, such as your virtual network implementation, your guest memory allocation, and your password device status. And all this can add to the complexity of implementing live update in your production. And we always have a big question mark that if we can also do the host kernel live update while making this happen. So it can make things more complicated if we are adding all this together. So to find the answer to the question about live updating host kernel, let's first take a look at how live update currently works in a few different approaches. And this slide shows a very simple naive implementation. And we use the existing live migration facilities in Qemail to do that, as a user file as the migration destination. So this is very straightforward and robust, as long as the migration, migrate command can work. Now you have your state on the disk. So we can do anything including reboot the physical machine to a new kernel or even use a different Qemail version before we initialize the, before we initiate the resume operation to load the state back. But this naive implementation also copies all the memory, which is not good. So it doesn't really meet our expectation or give much of the promised benefits of live updating. So it's definitely not hard to improve the situation here. And the community is also working actively on this. The key is to avoid largely copying of the guest memory but just to transfer it from the old start or load it from somewhere else. And on the mailing list, Stephen's sister from Oracle has posted the checkpoint and the restart work. The patch is set to enable efficient live update using Qemail live migration framework, which I will explain shortly. And also in addition to efficiency, the CPR patch series also supports VFIO PCI devices. So yeah, let's have a quick overview of how VFIO PCI devices work. So we have a better idea of how it fits in the big picture. VFIO PCI is essentially the control path of IOMMU status exposed by the host kernel so that VMM have some control of how IOMMU should behave. So it eventually allows the guest to use a function on the physical device, a VF or PF of a PCI device by the VMM setting up DMA mappings and interrupt relay at the host side. So the DMA and interrupt mapping tables for the IOMMU are like they are still fully managed by the host kernel, but they are usually initialized during the guest to start but not changed ever again. So therefore we can observe a stable mapping once the guest has started. So both the guest of physical address space and also the host of physical address space and the mapping between them as well as the DMA mapping, which is just basically the same translation. They don't usually change once it's established unless you do some special operations such as hot plugin, et cetera. But in the most basic of case, it can be seen as a fairly static state. So with that, we can also look at how the CPR approach supports VFIO PCI in its different operation modes. So there are basically two modes both supporting VFIO PCI with different restrictions. So the first mode is called the reboot mode, which namely you can optionally reboot your host kernel after you have saved your state somewhere. So what happens here is a guest agent is installed and it receives a request to live update cooperatively and then the guest agent will ask the guest driver to stop and discard any state as if the device is disabled. So in the process, the device and the driver both discard the status and only racing up after the loading back the state and the guest starts to run again. So this approach makes sure that the lost IOMM state during the kernel reboot doesn't affect the guest operation because we know the device is inactive, nothing could happen. And after the drainage lines of the guest driver, everything can start from start over. So in comparison, there's also another mode in CPR like this called exact. So exact means what you can do is you can exact your old QEMU into a new binary. So this is obviously there's no host kernel reboot allowed. It's not possible to exact across rebooted kernel. So in order to support the VFIO PCI here, we can preserve the file descriptors with a few introduced changes, especially to clear all the FD-C-L-O-E-X-E-C flags so that the VFIO PCI related flags will be preserved and everything suddenly still exists when you have a new QEMU binary. And there are also two VFIO operations added to make sure the DMA status is also consistent. So a quick recap of the different approaches and the focus in a little bit on VFIO PCI. Looking at the table, it seems we always have to trade something in if we want to VFIO PCI in the picture and do some life update. So is there anything else or is there something we can do better? I think the answer is probably yes. That's the topic of this talk. And we are focusing on this corner. So in the CPR reboot mode, so can we avoid the guest modification and remove the guest agent? So it doesn't shut down the device beforehand. So before we go to the solution, let's take another look at what's involved here. So in the mode, if we didn't install the guest, what would happen? What would go wrong? There are a few things that definitely are causing troubles if we were to just uninstall the guest agent. So first, starting from the old QEMU shutting down, it will clear and tear down the VFIO container, which will effectively drop all the DMA mappings. So the IOMM tables are no longer there. And the second step, if we reboot the device shutdown in kernel, which will call into the PCI core code, and then it will also reset the password device. And thus, new kernel starting up, probing and resetting all the discovered PCI devices, and including the password device as well. That also makes the device stay inconsistent. And also the new QEMU will also start on the new kernel and the new VFIO container will be created and the device attached to the address space. So in this process, there's also the device reset. So every single step of this process will make the hardware state inconsistent from the guest driver's knowledge. And there will be obviously IO error from the guest as if the device is broken. So put it another way, we want the device to look exactly the same as before in the same state. So that means the device mustn't be reset. And anything written to the config space or bars must also be still ineffective as well. So implicitly, this also included the IOMM mappings, not just the moment of resuming, but throughout the whole process all the time. Because imagine a network card will continue to receive packets from outside over the network. So it would want to do a DMA into the IX buffer which is sitting in the guest pages. So how do we achieve this? It sounds like a lot. But I think the key in a nutshell, which is the topic of today is to, how do we do a static page, like a static page allocation in order to prevent all these states to go wrong? And the static here basically means two things specifically. One is the guest pages are static before and after reboot. And the other point is to make DMA mappings static and also maintain that state all the time. So apart from that, there's also the inter-after remapping. But this is not critical because it can be disabled and discarded before the reboot and the re-established later. And as a consequence, we need a notify which can be a spurious IQ when we have the guest app. This is to counter the potential that maybe there's a lost interrupt in the process. So with that, let's see how we can manage the static page allocations. So first, let's look at the guest RAM pages. So here the approach, as you can see, there's basically three parts. We make use of the DX port in Linux. And what we do is we mark a physical address area as an NV-DM type. Note that here we are not really changing the volatile property of the memory itself, but just reuse the kernel NV-DM framework to manage and use the pages, especially using the DX capabilities. So in the first line here, I'm reserving an area of two gigabytes starting at six because I know in this particular configuration on my machine, this range is some DRAM that I can use. So once we boot the kernel with this parameter, which is our KVM host, the two gig area will now no longer be allocated for anything else. So then in order to assign this memory to the guest, we create a DAX device with the ND-CTL command line here, and we will then have access to a new file under the dev.fs slash dev slash DAX 1.0, which is a kernel device, but it can be used to map map into the guest. So using the Qemio's memory backend file option, Qemio can now pick this up and map it into the guest address space. So this is done both before and after the live update to create exactly the boot, so that the GPA address and the HPA address and the mapping is essentially the same. And then with some guest memory covered, let's also look at the IOMM page tables. So in order to do this, we introduced a static page allocator called KVM in the host kernel. It works similarly like on a reserved RAM region like the DAX idea before, but this time we are marking in the user EA20 table a new type number specifically reserving for this allocator, this small kernel module, and the corresponding syntax is a new mark in the map map command line argument. So this allocator is designed so that all the metadata as well as the data or the metadata can be thought as the super block of a file system sitting on P-Map. So all this information are all in the reserved pages so that the state survives K-exec. That is because K-exec doesn't really blanketly clear all the pages. So since like this, the data will still be there. It will not be reset. So this module exports a very simple API in order for other parts to request for pages in the reserved area. So the first API is a fixed allocation with an error type and an index into a small sub-area that are passed as the parameters. So this request goes to a predefined static layout in that region. So it's not flexible, it's not dynamic, it's static, it's a hard coding basically. So this will return a predictable offset into the reservation for any given purpose. So, and the second API is a dynamic allocation and free function pair based on a simple bitmap implementation. And then this can be used to allocate some pages dynamically, but the idea is they are referenced from the fixed pages which are the route. So with such an API, I patched the Intel IOMM code, the driver, a little bit in the host kernel to make use of the static pages. The changes shown here, so it will depend on this new option, but if it's enabled, then it starts to allocate any relevant pages from the K-run. So this is a relatively simple search and replace thanks to how Intel IOMM driver works currently. So I basically replaced the function call that originally is using a log page table page with this new API Kirm get fixed page for the root pages which are like the entry for any IOMM root cups. And then in all other places such as the actual page table for the DME mapping, we use the K-run allocate page. And these pages are referenced by the root page. So with these changes, we know that the addresses of the IOMM page tables, the data structures are stable and their data in the tables are also like thanks to the fact that guest RAM is always static, this is also stable. And then a little bit more changes in the VFIO PCI part. The goal is to avoid the state change that can break the guest driver as we saw earlier, the virus device resets, et cetera. So we still want to create the old QEMU and reboot all the hosts. So we must do a few things both in QEMU and the VFIO PCI kernel code differently to keep the guest happy. And the concept is here organized behind a new operation mode to support this scenario in a few layers called raw mode. So from the VFIO PCI, we are adding this flag to control the behavior of the kernel. And the QEMU will use this flag if it is live updating and it asks the VFIO PCI driver to skip a few things including the fast master bit reset during software state changes and also avoid any device reset and config space initialization similarly, et cetera. And finally, so QEMU will also take care of updating the interrupt status by masking and unmasking to avoid any inconsistency interrupt events and the consequences in the process because interrupts, if we don't disable it, the DMA can still happen if there is some activity triggered from outside such as the incoming packets. And then there will be interrupts from the hardware which the host kernel may not be able to handle because it is in the middle of K-exact reboot and since I'm not ready, there's no proper handler setup. So after we disable this in QEMU, we can re-enable it after the guest is restored and everything is in place and ready. At the cost of, we need one Spurious Notify which is to fix the problem of losing the some real notification in the middle. So that the guest driver knows it's time to look at the completion queue or event queue to pick up any new requests or new events. So this covers a few parts and that summarizes the key changes we have done so far. So there are definitely more to it such as hot plug-in devices than maybe even the VFR-PCI kernel status. It's not stable. It's not consistent, but it's not exposed in the simplest operations, but if we do more things, it can have some trouble. But so much more work is required to make this complete and working and correct. But here for this talk, let's see if the idea behind this thing works at all in principle. So we have put together a POC and taken care of some of the other details that we won't have time to cover in this talk, especially how virtual network can be a tricky thing to do. And by the way, Wusama Arif, also my colleague, gave a talk about the virtual host user last week in DPDK user space, which is quite relevant and interesting for life update, especially if we have the virtual user backends. So if you are interested, feel free to come to stay high and chat with us about these related topics. So back to the POC. I tested basically in Nest KVM just because it's very easy to work with and doesn't really depend on specific machine that can be used to play all this, but we will move to that later. So here our hypervisor 2B life update will work on a Qemio that enables Nest virtualization. So the hypervisor 2B life update is running in L1. And then I enabled the virtual IOMMU. So the virtual E1000DE, which is emulated by the outmost Qemio, can be passed through to the L2 guest. And the kernel and the Qemio in L1 are subject to for a live update here. And they are modified as I mentioned before to support the live update process. However, the Qemio and the kernel in L0 and L2, the guest image in the innermost guest, they are not modified and they don't have any guest agent to do cooperative live update. So here is the process to live update. So step one, I start L2 with VFL PCI, pass through the device into the L2 guest. So then I can stage my new kernel and the init run base with this Kexact standard operation in L1. And then in the L1, I also can now save the states using the QMP command introduced in CPR work. It will create a file that contains a state. So I write this file in a format written to DAX block tab, which is also reserved RAM, but this is preserved across Kexact reboot. So we can load it quickly from the memory. And then I show reboot in L1, so the kernel will reboot. And once the kernel is up, I can start L2 again using a new Qemail and load the state from the DAX block device, which is done directly in the new kernel using custom init process, because that's the fastest way to do anything in your kernel, once it's up and ready. So this is just to reduce the overall downtime because we tried and wanted to see how fast we can do this. And here is the result. So the result is the guess still works, the password device still works. The packets are seeing again after around 160 milliseconds, although the time must be taken with a grain of salt because it depends on the size of everything and the configuration. But the result is positive because we could live-update both the host kernel, the KVM kernel, and also our QEMail. And the chart here is SVGI generated just to break down what's taking time and the yellow in the middle is mostly about the new kernel booting itself. And the red after this process on the right is loading the state and setting a few things up. And the red on the left is mostly K-exec and saving states. So yeah, that's mostly all of the work. And going forward, we definitely plan to do more coding, code cleanup, and maybe handling kernel cases and errors and things. And as I mentioned, we want to test on some more serious beefy bare metal that are server configuration with many resources, et cetera. As well as try to integrate with some control plane to prove that this can fit in the private public cloud live-up-dating scenarios. And also the AMB and ARM support that need more work because the IOMMAO is different in different platforms. And finally, the downtime of the K-exec process and new kernel boot, we do have some other ideas to optimize that as well because it matters for the customer and the workload. And with that, I think that's it from my side. And thanks, so any questions? If not, okay, so thank you for your time and I hope you'll have a good rest of the KVM forum.