 Hello, everyone. I'm Peter Xu from Red Hat Virtualization Team. So today, my session is going to be about KVM-30 Rain interface, which is a new interface for KVM-30 tracking. So I'll start with some background information on migration and the dirty tracking, and especially on the existing KVM-CAD30 log and what we have done to improve it. Then I'll try to introduce KVM-30 Rain and how it was implemented. At last, I will share with you some conclusions and future work that we might be able to do. So this is a workflow of the general VM live migration. As we know that each migration will contain quite a few iterations. The first iteration will be very special that we will migrate all the guest pages. Because on the destination node, there is nothing yet. Starting from the second iteration, we will need to synchronize dirty tracking formation. Because since the migration is live, the guest is running concurrently when we migrate in the first run. So there will be some new dirty pages. We need to update these pages to latest on the destination node so that it will always contain the latest information. The synchronization of dirty tracking formation is done previously by an octal code KVM-CAD30 log. This octal majorly copies a dirty bitmap from the kernel to user space. For each of the bit, it represents a guest page. Tells us whether that page is dirty in previous iteration. So KVM-30 logging is not really ideal. It was very good initially when the VM is small. But since the VM gets bigger, the octal can get slower. If we look slightly deep into the octal, it majorly does two things. Firstly, it copy a dirty bitmap from the user, from the kernel to user space. The dirty bitmap size is linear to guest memory size. So for a huge memory, this copy procedure can take some time. And there is also another step called step two, which is to reset page protections. For example, if we use write protection to track guest writes, then it means you need to rewrite protect all the guest pages. This process can take a long time. And what's worse is that step two will scan the whole bitmap with MMU log held, trying to protect all those pages. The thing is, MMU log is very expensive. It plays a similar role to MM semaphore in the general Linux process, because whatever we do like a normal page vote may take this log to resolve the page vote. So if a KVM-30 log thread took this log for a long time, it means all the rest of the vCPU thread can hung for a long time trying to take this log. And after some measurement, we can see that on some big systems with a huge memory, this step can take a few seconds for more. It means the vCPU can hung without responding to user interactions. And the workload can stop. That is not good. So the community tried to improve this condition by introducing more capabilities into KVM-30 logging. I called it a few variances. The first variance is a new capability called KVM-CAP Manual-30 Log Protect. This capability tries to solve two things. Firstly, it separates the steps. As we know, we have two steps in KVM-30 log. With this new capability, we are able to separate these two steps into two octos. So the KVM-30 log will only collect 30-bit map, but we will keep the pages writable. And we introduce a new octo called KVM-CLEAR-30 log. And this octo will be responsible for reset page protections. What's better is that we, since this is a new interface, we can try to let it take an extra range parameter so that we don't need to reset the page protection for the whole KVM slot. We can reset it with a finer granularity so that we only reset a subset of guest pages in the slot. This new capability greatly improved the VM responsiveness. And it's vastly used, I believe, in major new kernels. However, even with the variance one, we later on noticed a fact that even the enabling of KVM-30 logging is slow, too. It's because when we try to apply the KVM-M log 30 pages onto the memory slot, that's basically how we enable the dirty logging. This step requires an initial reset of page protections on all guest ramps. For example, all the ramps were writable. And when we start dirty logging, we need to write protect all the pages. And that process can take a long time as well. So to solve this problem, we introduced another new bit called KVM-30 log initially set. This bit is very interesting because it sounds very simple, but it really solves some problem. The idea is very simple, which is if this bit set, we initialize all the dirty bitmap with once, which means we assume all the pages were dirty initially. That actually won't affect major user space applications since for migration, all the dirty bitmap will be set to one anyways in the user space. So it won't affect user space on migrations. But for KVM, it is a game because if all the dirty bitmap is set, it means we don't need to write protect it when we enable this feature. So we skip page protections in the first iteration, which is quite interesting. And after some measurement, it was reported that migration starts even 10 times faster than previously without this bit for a guest with 128 gigabyte memory. So we can see that what we did try to evolve KVM-30 log to make it better and to be more suitable for huge VMs. But we can see that dirty bitmap is both the good and the evil. It's good in that it is an ideal structure for many reasons. Firstly, it is very efficient because we used a single bit to represent a guest's small page, which is ideal. And probably the most efficient way to store this data structure is we are going to cover the whole guest's memory. And it's very easy to be serialized using atomic operations as well because basically we are playing with the bits. However, VMs are getting bigger. So as so are the dirty bitmaps, which means collecting dirty bitmap will always be slower because we always need to collect it per VM. So it will be always huge work and huge overhead. It will definitely take time. And sometimes we need to sync dirty bitmaps somehow between source and destination. Like for post copy, we need to discard dirty bitmap and for post copy recovery, we also need to synchronize this such information. So fundamentally, the dirty bitmap structure is hard to scale. That's also the reason that maybe we can try to change this fundamental structure and think about something else, which is friendly to huge virtual machines. Here comes the KVM-30 ring. This work was originated from Leishau into a 1.7 or even earlier that I'm not aware of. And later on, Polo took it over and me. So I believe it was initially designed for KOLO, which is the so-called high availability infrastructure for KVM-30. The design is quite straightforward, but we can see that it's totally different because the dirty bitmap is gone. We use the instead of the bitmap, we use per viscipial rings to store dirty PFNs. PFN stands for page frame number or whatever we think. It's just a page index. And the most important thing is we use per viscipial rings. This is funny because it means the data structure is not global anymore. And also, this ring can be configurable. It's not linear to guest memory size at all. It can be very small, like 4K, maybe 64K. So the second thing is we start from the beginning. We separate collection and page protection. So we separate the step one and step two. It can be done separately. And since we are going to introduce this new structure, we make it a shared data structure between the user space and the kernel by using a map to map the same page in the kernel and in the user space. So there is no extra copy when we fetch the information. The user space just needs to read the sum of the memory address to fetch this information. And it was very thread friendly because we use thread local buffers. So this is how KM30 ring looks like. As we have already mentioned, for each of the viscipial, there will be 130 ring that bound to it. For each of the ring, there will be a multiple of small page frame number entries that we can configure when we start the virtual machine. We use the two extra bits in each of the PFN to keep the status of this entry. So we'll talk about it later because each of the page index can be either a dirty address or it can be something to be reset. There is a state machine that we need to run. About the state machine, it is actually quite simple. So for each of the entry, it initializes with empty state, which means it is free to use. So when the kernel, when some viscipial tries to write a page and we trapped it, we will try to insert a new dirty entry into the dirty ring of this viscipial. This is done by KVM, of course. And we will mark this entry as dirty along with the page index. So after that, the user will be able to see this newly dirty page. So it can try to collect this dirty PFN. After that, set the status bit to collect it, which tells the kernel that, OK, I finished using this address. You can try to recycle it. So the last step is done by the kernel again, which is try to read this PFN again. So we know that this PFN has been used and consumed. We reset this page that the PFN points to, and we clear the entry, which go back to the empty state. So it's a quite simple state machine, but this is how we generally split the steps of step one and step two as we talked previously. This is a closer look, but due to time reason, I don't plan to dig into it. And this is a comparison between the old dirty logging and the dirty ring. Anyone can feel free to reference these two. So I'll skip in as well. We can see that we have quite a lot of differences between the two interfaces. So this is some interesting part, because dirty ring is a totally new structure, and it brings something new as well, like the full event. Because we know dirty bitmap won't be able to get a full event because dirty bitmap was designed for a whole guest memory, so it won't get full. Dirty ring is different, because it is a configured size ring, so it can get full. As the vCPU continuously to dirty the page, and as long as we don't collect it fast enough, it can get full. And when it happens, actually, what we do right now is to interrupt the right instruction. So instead of continue this vCPU, we will do a VM exit. And actually, a user space exit with reason KVM exit, dirty ring full. This is a newly introduced exit reason, so that the user space will know that, OK, this vCPU got this ring full, and it's dirty in the RAM quite a bit. We will try to rip the dirty ring to at least the free sum of the slots, send a new aquatocord KVM reset dirty rings so that the kernel will recycle those dirty slots and continue the vCPU. And the vCPU will retry the previously interrupted instruction again. So this is the whole process should be quite natural. But what's funny is that we accidentally introduced such a way that we can synchronously handle dirty tracking, which leads to another interesting fact about the side effect on auto-converge maybe. Because this is not something we have with KVM dirty logging. Because dirty logging cannot. It's always asynchronous. We cannot stop the vCPU. But with KVM dirty ring, we can. Which means, unlike dirty logging, dirty ring tracking can flux vCPU, and it will provide auto-converge with a finer granularity of what to throttle. So auto-converge was trying to throttle the whole system always. For example, we have a throttle parameter decides how many clock cycles this vCPU can use. But this parameter is applied to all of the vCPUs. This brings a problem so that if, let's assume there are two threads, one is a worker's thread, one is a GUI thread. The user is using the GUI. And if the worker thread 30s, the guest memory too often or too heavy, it will greatly, the auto-converge throttle will be greatly increased, boosted. So the GUI will stop. This is not good. What we really want is we make this worker thread slower. And we keep the GUI running so the user can still be responsive. So this is something that we can probably improve in the future for auto-converge with KVM dirty ring, because we now have the ability to identify which vCPU is dirty into fast. Also, we have better responsiveness as well on the so-called trap points. Because previously, when we do the auto-converge logic, it is only done during dirty sync. But dirty sync is very rare. It only happens at the start of each iteration if we still remember the previous workflow of live migration. But KVM dirty ring provides us a way that we can trap nearly every RAM write of the vCPU. Not really literally, but because every write can trigger a ring for. So it can be really responsive on evaluating which vCPU is heavily dirty in the RAM. So with KVM dirty ring, a better auto-converge can be really possible. Maybe we can have a version 2.0. So that's something we can probably consider later, because it will be a work totally in the user space. Some quick conclusions. So we know that we have quite a few benefits by introducing KVM dirty ring. And it can be something more than we have. Because as I said, I believe it was initially introduced by Kolo. But maybe we can find some new scenarios that we can use dirty ring that we haven't imagined before. It definitely has reduced memory footprint on the dirty tracking data structure. So the synchronization is cheaper as well because it can run the background reading some memory, just to read some memory, rather than some heavy alktos. And it is definitely much more friendly to huge VMs. So possible scenarios, not only the major thing should be Kolo, but there can be a lot of things that I already mentioned or even more. There can be some future work about KVM dirty ring. Firstly, I will try to move it forward on the review process and have it merged because it's still an ongoing review. Also, I would really like to probably after it's merged because so anyway, probably we can try with more real world runs with the dirty ring so that we can see what's still missing, what we can make it better. So the first thing I'm thinking about is whether we can support the non-X86 because right now, it only supports X86. But we can think about other architectures. Also, we can have a per recipe ring reset. Currently, it was a lot of global. That's another story, but definitely not going to be covered in this talk. But we can just think about something better, probably after we have some more real world runs to see what's the bottleneck. For QML, there can be quite a few things to do. Firstly, we can support the new interface in KVM.c. That's something I have already done in my test branch. So after the current series, I'll try to move forward this one as well. So it's a tick. The next ones will be question mark because it's just the same thing I was thinking about. So if we know, if we are familiar with QML migration infrastructure, we will see that not only the KVM layer, there is a dirty bitmap. We have dirty bitmap in quite a few other layers as well, including the run block layer, including the migration layer. So how about we remove all these dirty bitmaps? So it's a bigger work comparing to the first one because the interface is easy. However, if we want to replace all the dirty bitmaps in the other layers, we probably need to think more, at least on how it affects TCG, not TCG, VGA, and other users. We need to make sure they won't be affected. But I believe migration should be the major one. And also, we need to think about something that we might have missed. And with all these things removed, maybe pre-copy can be able to read dirty pages in Qs. So it will look more like post-copy. But as a side effect of this, all the converge will be on by default, since rinse can get full, right? As we have already mentioned, the dirty bitmaps won't get full, but rinse can. So all the converge will be a must if we remove all the dirty bitmap and use Qs, like ideally, per-visibil Qs to cache the dirty pages. So it can really look something like this. But again, this is an imaginary world. So we can have other solutions. Basically, this is something we may think about that can be really friendly to huge VMs, assuming it is coming, like more people will be using huge VMs. And this seems to be one solution, comparing to the other one which uses post-copy and even some more enhancements there to make a post-copy better and more suitable, more responsive for huge VMs. And this one can be something as well that we try to throttle the source vCPU dirty rate. But in the meantime, the most important thing is to keep the whole story responsiveness by smart logic on how to manipulate the dirty rinse to control how vCPU runs. So if with a very nice control, maybe we can have very good responsiveness to control the vCPU. But at the same time, keeps the responsiveness. So the last page is about some further references. Anyone can check the latest version of KVM dirty rinse here in the first link. And I also pasted the repos for testing just in case anyone is interested for both kernel and QML. So that's all. Thank you very much.