 Hello everyone, welcome to the 2020 KVM forum. My name is Zhang Yulie. I'm working for Tencent Cloud, which is the leading public cloud provider in China. This session's topic is Advanced Paral Memory Virtualization. We would like to use this opportunity to introduce idea about how to mitigate the burden of MMU log to boost guest performance in virtualization environment. After this session, you can reach me through the email address if you are interested in it, and any questions and suggestions are welcome. This is today's agenda for today's topic. We will start with the issue that inspires us to create this solution and go through the details about the implementation. At the last, we will talk about the plans to continue improving this, and hopefully, it can be upstreamed in the future. In our cloud environment, we got an issue that there is a significant performance drop in the guest after live migration. The commonalities for these guests are they have multiple VCPs, usually have numerous memories, and with a huge page table enabled. The workload in the guest becomes really slow after migration, which made us think about what happened behind the scene. So we will try to debug this issue. After debugging, we find the following phenomenons. From the trace log, we find there are a bunch of page fault after the migration, because after migration, when the guest tries to access memory, it has to set up the page table at first, and also performance drop getting worse if we change or enlarge the guest memory size. So what is happening? Let's look at the page table setup procedure now. This is the diagram about the EPD setup process in current KBM. The EPD valuation will cause the VM to exist, and the EPD valuation handler will check if it is the MMIO address. If not, it will check if it can be handled by the fast-page fault path, which is used for dirty log and access tracking. If not, either, then it will go through the real-page fault handler to ping the guest memory at first, and then before updating the page table, it needs to acquire the MMIO log, and hold the log until it finishes the work. So that you can see the MMIO log will block other vcpues from touch the page table, and force them to become a sequential operation. If there are multiple vcpues have to update the page table at the same time, they will have to wait in line, thus they cannot return to the guest to work on their real tasks until they have the page table ready. So the performance drop in the guest is expected. Now we know the bottleneck is the MMIO log on the page fault handler path, but the log is used to protect the page table, and synchronize the updating. Can we get rid of it? In order to solve this problem and invade the burden of MMIO log, here comes our proposal for it. We will pre-construct the page table for the guest, so that it won't be necessary to VM exist to the host to set up the page mapping again during the guest lifecycle, and it is also able to locklessly update the read-write status in the page table for the dirty log and page tracking. This is the overview diagram for our implementation. When the guest boot up to set up the memory region with our control like KVM set memory region, we will construct the page table according to the memory source change, and it is protected by the KVM memory slot lock. After that, it is also support the dirty logging, as the IO control is also protected by the same slot lock, and the update to read-write and dirty bit in the PTE is atomic and lockless. Here's the detail about the page table pre-construction. After getting IO control about the memory region change, we will iterate to pin the guest memory and set up the page mapping with different page size granularity. The page label is stored in the root point named global root HPA, so before the VCPUs enter the guest, it will invoke the MME lock to load our root point into CR3 to use the pre-populated page table, which will help them get rid of the page folder exception. And in addition, we want to support the migration for the guest. In order to do so, we have to break down the page table into 4k granularity when it starts to do the dirty logging, if there is a huge page enabled. And if hardware supports dirty log, we will clear the dirty bit on the page table entry. Otherwise, we will set up the write protect on the page table entry. And if the migration fails, we will restore the page table as well, with a huge page enabled. Thus, you will find the original scenarios that VCPU handled together to update the page table is gone. Now that VCPU can update the page table in parallel, which can improve the guest performance and get rid of the burden of MME lock. This is the initial perform data we got with the initial patch. We create a guest with 32 VCPUs and 64 Giga memories, and let VCPU dirty the entire memory region concurrently. So each VCPU threader will dirty 2 Giga memories. We compare the time for each VCPU threader to finish the dirty job. You can see in 4k granularity, improvement is huge. With normal page photo process, each VCPU takes about 18 to 21 seconds to finish the job, which with pre-populated method, it only takes 2 to 2.5 seconds. And if we enable the huge page, for example, about 2 mega granularity, the normal code path will take about 3 to 2.6 seconds, which is about 1.5 times longer than our solution. So after all, you can see the benefit we got from the new implementation. As we got lock access to the page table, guest will get performance improved as it didn't need page photo to set up the page table. And also we can save some memory system resources as we do not need the MMU notification so far. And we can also drop the shadow page catch and parent reserve the mapping. But still, there are some limitations for us. Currently, we only support the future with MMU virtualization enabled. And so far, the system management mode is not supported. As we need pre-pinned guest memory, if we have to support the system management mode, we have to pin the same memory slot for them, which requires extra memory to do this. So probably, we will fall back to original page photo mode when it enter the system management mode to only set up the page mapping undemanded. Again, we have to pin the whole guest memory in advance, so the memory overcommitment is not supported either. At last, let's talk about something we are working on right now. As Ben Gordon pointed out in the mail list, we do not support the post copy in migration yet. As we know, post copy rely on the user photo fd to handle the page photo in user space. In the post copy live migration context, the newly spawned VM tries to access memory page. And if it fails, the VM fetch the memory page from original VM memory over the network. But in pre-populating method, we already set up the page table mapping on the target site. So there is no trigger to help target the VM to fetch the memory from the source site. As we plan to make this implementation a common solution to support the post copy, as this diagram shows, our idea is to partially invalidate the page table mapping with mAdvice system call, so that it could generate page photo for the user fd on target site to handle the post copy. This work is ongoing. We will send out the office again when we finished. Below is the link on this page. This is the link to the page set we sent out. You're welcome to have a try with it and bring us any feedback if it is helpful. Thanks. I think that's all for today's sharing. Thank you. Any question?