 Hi everyone, my name is Lu Xiumin, I am now working back at hands and mainly focusing on virtualization related projects. Today I will show my topic, deep optimization of VMM live upgrade. It will be divided into five parts. First, I will introduce the background. VMM live upgrade is an efficient way to upgrade the VMM software, CUMUL and KVM for a new version without interrupting the guest. It can be used to add critical bug fixes and new features to the VMM. It is well known that service downtime of live upgrade is a major concern of cloud providers. Also, live upgrade has done much better than live migration and uses less resources. But in some cases, especially for large virtual machines, the total downtime can be as long as several seconds, which is obviously unacceptable for sensitive applications. So we take efforts to further minimize live upgrade downtime. Before that, I will introduce our basic VMM live upgrade framework. First, we will work as an expert to load a new CUMUL binary so that we can inherit any FD from the old CUMUL, including memory FD. And by this, we don't need the iterating memory copy. This picture also shows the flow path of live upgrade, which is similar to live migration. First, load a new CUMUL and do the initialization. When finished, notify the old CUMUL to suspend the guest, which is the start point of the downtime. You need to suspend it to pause the CPUs and stop the devices, and then save and load the device state in parallel in the old and new CUMUL. We will use true RAM to sync and transfer device states between old and new to optimize the downtime. Finally, the new CUMUL gets the full VM state and start up its end of the downtime. We also divide the KVM module into multiple duplicated modules, which can be loaded simultaneously. Thus, we can upgrade KVM by loading new KVM modules. The occupied KVM module will not be overrided, and then migrate VMs from the old one to the new one. In order to do the optimization, we should analyze the downtime of live upgrade first. This picture shows the breakdown of the downtime of a VM, which has 64 CPUs, 256 GB memory. One module will host the native device, and two modules will host the broad devices. The load is in the old CUMUL. Mainly consists of VM stop and VM state seed. The app is in the new CUMUL. Mainly consists of VM state load and VM start. We can see the main time costs of the host user devices stop and start respectively in VM stop and VM start. Then, device state transfer also takes some time. Then, I will introduce our optimizations. Now, we have known the hot spots, and the most direct question is, is it necessary or efficient to stop and start devices, and transfer device state by directly reusing the live migration framework? For CUMUL internal emulated devices, yes, we still needed the process, but the number of them are usually limited, and the time costs are relatively small. But for external devices, such as real host user devices, whose data pass a little VMM intervention, it is unnecessary and can be optimized. Let's take real host user devices. If decay and spdk for example, as we said before, we can inherit any fd from the old CUMUL by using no close x curve lag. So here, for real user devices, we inherit channels, specifically main channels and slave channels. And show the fds mainly kick and co-fds and inflight fds of spdk between the VMM and the real host user back end. Then, we will use them directly in the new CUMUL and escape the related init processes. After inheriting the fds, we can make VMM live upgrade transparent to rehost user back end. First, don't stop the back end in the old CUMUL, just keep it wrongly. Second, skip all set communications, such as set family table, set VMM number, address, set features, and so on. And also, some get communication, meaning get VMM base and get inflight fd. We skip these communications to the back ends in the new CUMUL. By this, the back ends are on way off in live upgrade. What's more, there are some issues worth looking. First, since the back ends are keeping wrongly and may trigger in clouds, we are co-fds even after the guest has paused. Then, the new KVM may miss in clouds, which I received and pending in the old KVM. Our solution is, simply supplement interrupts unconditional when finishing the live upgrade. Second, if the back ends crash or send a slave message to the mast, it is uncertain what CUMUL will receive the messages and across the mass. To avoid this, we limit the new CUMULs to start to listen on the slave channel only when finishing the upgrade. And if there is any back end crash or slave request, just use this upgrade. As far as I know, the slave channel is rarely used in DDK or SDK. Also, crash really well. So, this message will have little influence on the success rate. Third, since we don't upgrade the memory table during live upgrade, there will be still memory table date in the back ends. To solve it, we can still do the update, but without additional effect, such a back end will start. Or we can unmap the guest graph at a fixed and very high address optimization. And since the vertical related state in the data plane is kept in the guest and back ends, there is no need to transfer this state from the old to new. Except the data plane state request is mainly the complete state, which is much less changed during the VM lifetime. So, we can pre-transfer it before VM pause and keep track of the complete change right before the pre-transfer. So that if any change occurs, we can re-transfer the complete state after pause as before. By this, we can skip the state transfer of warehouse user devices within the downtime with high probability. Yeah, warehouse user is one important case, but our optimizations can also be applied to warehouse, warehouse, and so on in a similar manner. And even for KVM, we can add a mode, namely KVM upgrade in which we will inherit the KVMFDs and skip the related init processes in the new KVM. Thus, there is no need to sync the VCQ state from the old KVM, transfer it, and put it back to the new KVM. They can be skipped. By this, we can make KVM only live upgrade more lightweight. Go to the experiments and I will show the effects of our optimizations. First, we have compared the downtime with and without our optimizations on the swing workloads, idle, CPU strength, and FIO. In this picture, the left is a relatively small VM, which has 16 CPUs, 64 GB memory, with one module warehouse user network devices and two module warehouse user block devices. And the right is a large VM, which has 64 CPUs, 256 GB memory, with two network devices, and 10 block devices. We can see the downtime of a small VM has fallen by about 80%, and for a large VM, the downtime has fallen over 90%, dramatically dropped from about one second to only 17 microseconds. Besides, we can see that after optimization, there is almost no increase in the downtime when under the FIO workload. Compared to idle, we also measured the effect of our optimizations on packet loss. In this picture, the left is a ping result before optimizations, and the right is after optimizations. We can see that before optimizations, we were lost about 40 packages just in one upgrade, and the max latency is over 2 seconds, much higher than the wasted build-on time surely just now. And after optimization, there is no any packet loss, and latency is close to the wasted build-on time, about 19 microseconds. That's all. Thanks. If you have any questions for this topic, you can also contact me with this email. Thank you.