 Hello, everyone. My name is Peter Xu. I came from Red Hat. Today I'm going to talk about KVM dirty page tracking. So what is dirty tracking? It is the method to track guest memory changes for different reasons. And for my understanding, there are actually two types of dirty tracking. The first one is synchronous, which means the Tracy or say the guest needs to be blocked during the process. For example, the shadow page table tracking or VAM live snapshot. The other one is more commonly known because of migration, and that is the asynchronous page tracking. In which case we don't really need to stop the virtual machine or the VCPU as the Tracy. Instead, we need to report dirty information afterwards using some of the interfaces provided. So I will talk slightly more on migration in this talk. But before going into that, I really like to mention, still mention the synchronous dirty tracking that we used in either KVM or KVM. Because they can be overlooked because they are not as popular as migration. But I still think they are very interesting to know. And it plays a role there. For example, the KVM guest page table tracking. It is only used by shadow paging, so TDP is not needed on that. What it was used to do is to track the guest page table pages. And we want to make sure the shadow page table, which is a one-man mapping of the guest page table, could be always synchronized with the guest one. So we need to invalidate the shadow page table synchronously when we detect that the guest page table changed. It's an KVM internal interface. There is actually an interface inside of the kernel, so the kernel drivers can use it. But it's not exposed to user space. So the next example is a VM live snapshot. That is when we want to take a snapshot of the guest devices and including memory, of course, at a single point of time. It is based on user-for-fd write protection so far, so it only supports anonymous memories. And because we don't really have the KVM interface for synchronous tracking, we need to use the common MMM interfaces like user-for-fd. It looks really like to migration, but it actually tracks dirty pages synchronously. And it actually samples the guest memory state at the start of the process rather than at the end. While migration is actually storing the page status at the end of migration. So they really look similar, but they are really different. And for asynchronous dirty tracking is mostly all about migration. And for that, we actually have two steps to achieve it. Firstly, we need to trap it. And that's what we also do for synchronous dirty tracking as well. There are two ways to do that. Write protection is the most commonly one. And PMR is only applicable to asynchronous dirty tracking like migration. Because it doesn't even need to interrupt the guest of vCPU from execution and is very efficient. So this allows us to trap when the pages were written. On the reporting side, we have two interfaces so far in KVM. The first one is the long-standing dirty log. And actually there is also the new clear log interface that I would say is also part of the dirty log so far. Which is per VM bitmap-based solution interface. Dirty ring is just added to Linux in the past year. And it's a per vCPU ring-based solution. And it's discussable that dirty ring can also use the clear interface. There is some trade-off like we can have more efficient to re-protect the mechanism. But in the meantime, there will be some more TRB flashes. So maybe it will suit different needs. But in this talk, I didn't plan to go deep into that. So next I would like to share some of my understandings on dirty page tracking. And affecting migrations and the challenges that we are seeing. Firstly, I would like to say that upstream KVM and actually QMIL as well is evolving with more efficient migrations. I shared some of the examples. Firstly, there is Kotian's work from Huawei on lazy red protect of huge pages when we enable dirty tracking. That only works for initial all set when the clear dirty log is enabled. But it should be really the default that we use for dirty logging. Secondly, there is Ben's work and I believe is a teamwork on the TDP-MMU solution. Which is a huge change and the new TDP-MMU that may replace the old, I think. Which replaced the TDP-MMU spin log into a read-write log. And it allows concurrent pit folds. But of course it allows concurrent dirty tracking as well, which is great. So I think it's a very important step towards huge VMs on either supporting huge VMs and also especially on the migration side, which I will talk about later. The third thing is KVM dirty ring. It landed last year as mentioned, not only on Linux, but also in QMIL. The QMIL support is only an initial one because I think we need more work to finally enable dirty ring as a whole feature. I think it's already working and making sense, but we need more works to do. Generally, I think KVM is evolving even more faster in the last year. So QMIL should really catch up with it. And I really think that huge VM migration will be a very important topic to be discussed. Not only in the past year, but also in the next few years. Regarding migration. So what is huge VM migrations? What is huge virtual machines? I see it like it has more bcp use, like more than 100. It has a lot of memories, like at least one TB. And in those cases, it probably means the user is very serious on using these virtual machines. And they really run some important workloads. So implicit requirement is also that there is some quality assurance sometimes even during migration. So they would like the workloads to perform not so bad even during migration. It's not a hard limit, but it happens in some cases. And it happens even more in the huge virtual machine cases. That brings quite a lot of challenges here. Firstly, it's about the scaling of existing algorithms. Some of the algorithms may run well on the small virtual machines. However, they may not really run well on the huge virtual machines. And it could be a problem. The second one is about auto-converge. Convergence is always the issue. And as we know, with more bcp power, with more memory, converge is even harder on huge virtual machines. Not to mention that people carry even more on the workload and even during migration, they don't really want to stop their workload. So auto-converge is probably and similar ideas on throttling may not really work. In some use cases. So people may like to have a more strict requirement on migration. The third one is about huge tier events. That seems to be the de facto standard for huge virtual machines. It could be not required for all of them, but it's growing. The users are getting much more. And so I think we should handle well on huge tier bfs on every aspects. I tried to summarize all of these challenges, not only the ones that I mentioned, but at least the ones that I think are important in the near future and the solutions that may help. The first one is, of course, the not scaling algorithm issue. I think it's a long-term effort and not only in QMU, but also in KBM. And it's getting better. I think it's really getting better, but still we may need some more works to do. The second one is convergence. Convergence could be a very important factor for huge virtual machines. So probably as we don't have a chance to throttle those, even if we have a dirty ring to have a better granularity on throttling, maybe it won't work because even if we throttle the work threads, it will affect the performance inside of the gas. So post copy is probably required because that is not at least proactively stopping the gaster from running. The third issue is about data copy bottleneck. As the virtual machine gets bigger, the data copy will become a problem. And we do observe some of the perv traces that we see. A lot of time used in some message and so on. In this case, we may want either MaliFD or the zero copy sockets or even both so that we can make sure we can saturate the network when there is still a chance we have a very great network. There is also possible that the network is really slow, so the bottleneck will be on the network side. That's another problem. But so far, we should really care about data copy bottleneck from the percentage point of view. The fourth issue and the fifth issue are all about down times. Basically, down time is not about throttling the gas but stop the gas. So there could be an issue when we stop the gas. So because the workloads will be affected and that is not good. Firstly, there is a down time during post copy handling of page faults. As we mentioned, the huge TRBFS could be a very major factor in the future and even for now we need to think about it because huge TRBFS has a major latency on page fault handling for post copy. There is one thing called a double map of huge TRBFS that can reduce page fault latency but this is still an idea. It will allow the huge TRBFS to make a one gig pages to be mapped in smaller chunks. It means the page fault can be handled and the data can be sent faster. We don't really need to wait for one second to fold a page anymore. After migration, we can of course still merge small pages into huge pages. Previously, we did that as well for THP but right now this is huge TRBFS and it will work the same. Another thing is this is still an idea and not available upstream but it is discussed upstream in the MM community. The fifth issue is the downtime when we switch from pre-copy to post-copy. As we mentioned, post-copy is probably required so the switching can cost downtime as well. Just like the end phase of pre-copy. So one thing we could consider using a leverage is the work from Google that introduced the user-fault hefty minor mode which allows the destination VM to run earlier on the stored pages. When I say stored pages, I don't really mean it's always the stored pages. I mean, for example, what we do right now is we migrate the pages and each of the pages we can migrate multiple times. It means there will be some pages on destination but not all of them will be the latest. Some of them are dirty again on the source. When we switch to post-copy, we probably don't know which one is the latest. So what we do right now is we start the virtual machine. Before, we copy all the big map and all those information because those can take time. And we just run the destination virtual machine. We swipe the page table to make sure there is no page table installed but we have the page cache installed. And when we access any page, we got a page-fold. We ask the source whether it's new or not. If it's new, it's okay. Or if it's old, we send the page alongside so we can resolve the page-fold on the destination. That is called a minor-fold. One thing to mention is that there will be no anonymous support but it will support shemmmer and huge TFFS. So for huge VM migrations, it could still be helpful even if QML supports a lot of cases that is anonymous memory-based. Because for huge VMs, as long as huge TFFS is used, minor-fold will be supported. And the QML may need some new MADvice syscall to that page table but keep the page caches. Something like MADvice don't remove but we don't really want to remove the page. We want to remove the page table entries only so that we can fold them in. We can trap the page access later. So all these are just the wild ideas that I think may solve those problems and all of them are still discussable. So it's more like a sharing of how I understand these problems. Before I end the talk, I would like to share one example regarding dirty page tracking on the not scaling issue. Which is when we copy the bitmaps. We know we have a very common interface of the dirty bitmap to report dirty pages. And we have actually dirty bitmap in QML and KVM. What we measured is that we tried to migrate 3TB gas which has a bitmap of 100MB. And the synchronous of dirty page bitmap took 200 milliseconds. That's a huge amount of time. And that is so large. KVM is very fast. It just copies the bitmap with clear log. So a lot of time is spent on moving the bitmaps in QML. The reasons are we have 3 layers of bitmap in QML. Namely KVM slot, we have Remlis bitmap, and we have migration bitmap. We move them. And the different devices could have standalone bitmaps as well. KVM is not a device but is an accelerator and it has its own dirty bitmap for sure. We have the Vhost device, we can have its own bitmap, and the VFIO as well. And we normally copy bitmaps using compare exchange. Or say atomic weapons for threat safety, which is good and very safe. But the thing is, firstly, we may consider merging and reducing bitmap layers and operations in QML because there are a lot. And secondly, which I want to talk slightly more because it's easier, is whether we can copy the bitmaps more efficiently. This comes to the topic on whether we should copy bitmap using atomic operations. As we know, atomic operations are heavily used in 30-bit map operations for threat safety and actually on every single bitmap, not only 30-bit map. However, atomic operations are not so cheap because it needs to lock the memory bus. I did a quick measurement using exchange instruction. Of course, with the memory bus locked versus the normal memory copy, which is a move instruction, for example. And I measured it on my laptop. I did it in two cases. The first case is when the data memory access is all cache sheet in L1. How I do that is I just use a single value and I repeatedly compare exchange on this single value to read it out. It is actually eight times slower than pure memory copy. I think that's a pure overhead of memory lock, memory bus locking. And if I try all cache means in R3, it means I tried to exchange a huge bitmap larger than R3 cache. It is actually only three times slower, but still it's three times slower. For more numbers, we can reference to this. So I started to think maybe this is an accident. I should try it on more than one host. So I tried on the other one that I still have on the testing machine, which has a XION CPU with E5. So I tried on both of them and I noticed that there is some difference, but the radio is very close to three times or even 3.5 on that testing machine. What the test I did was just try to copy the bitmap of ATB memory, which is actually 256 megabytes bitmap. The test case can be found on the bottom of the link, which is a very simple program itself. So it means maybe it's not efficient to copy using exchange for a huge bitmap. We don't need to lock the memory bus for so frequently. So what's the solution? So I had a look at the KVM side first. I think KVM does not have such issue. Mostly because with clear lock, we do copy to user bitmap without compare exchange. So it's the efficient way that we should use. And when we re-protect, exchange is used. However, the overhead is probably buried in page table walks. What I mean is that firstly, we can see the block B, which is got dirty process. I think we don't really need the exchange. So we can use normal memory copies. But it shouldn't really affect a lot because we have extra overheads on, for example, right protecting page tables. So this change may or may not matter much, even if I'm right. Block C is when clear dirty. It has a similar semantic. We used atomic, but it's not really that important because even if we can speed up this atomic instruction, later on we need to re-protect those pages. And it won't show a huge effect. However, I think for Cumule, we really need to consider because Cumule, as I mentioned, Cumule has a lot of layers of bitmap. We have a lot of bitmap moving operations. We really need to rework on copying and merging bitmaps. So the solution is probably very similar to KVM. We need to make sure that when we copy huge bitmap, it needs to be without atomic operations. And how we do that? When set dirty, we can use a read-write log plus atomics. So firstly, the atomics is used to guarantee threat safety for read concurrency. So when we set dirty, for example, a lot of vCPU threads, that's imagine TCG running and they want to set dirty. Each of them should take a read log plus atomically set dirty. That is okay because they only set one bit. But when we collect and copy dirty pages, we should really take the right log to make sure there is no vCPU, TCG or any vCPU threads running to dirty again. As long as we have the right log, we should be able to memory copy. And that will avoid oraces. And by the way, read-write log contains memory barriers by nature. So it should be always the latest information. We don't need to worry about when the dirty bit is set, but we didn't collect it. But this is an idea. I never tried to verify it. I need to verify it later. Okay, that is all the things that I plan to share. All comments are greatly welcome. Thank you very much.