 Hello everyone, my name is Peter Xu and I work for Red Hat Virtualization team. Today my topic is about post-copy preemption This is the outline of the presentation that I plan to provide Firstly, we will quickly go over life migration pre-copy and post-copy and how it works. And then We will start to focus on post-copy on its limitations and challenges As a follow-up, I will try to propose three optimizations based on existing vanilla post-copy different aspects of it and And all of them will try to improve paid request latencies by reducing it At last some performance results will be shared and I will also mention some of the future works that I plan to do So what is post-copy? Comparing to pre-copy, it is a way to allow virtual machine to start On destination without migrating all the data, which means the virtual machine will start to run with partial RAM migrated To achieve this we need some way to trap page faults when there are missing pages which is called user for the FD in Linux and The post-copy has a very good thing in that it always converges and it's probably the most reasons one of the most reasons that people use user for the FD is to make it a converge especially for Huge virtual machines or virtual machines that may not easily converge with some relatively high workloads And What is post-copy preemption? It is a new capability introduced only for post-copy not for pre-copy Because it is going to solve issues in post-copy only It needs to be enabled on both forests and destination of cumule, which is quite common We need to do that anyway for most of the capabilities It is not compatible with vanilla post-copy. It means We cannot migrate from an old cumule which has no post-copy to a new cumule which has post-copy preemption So we cannot really migrate from a vanilla post-copy to the new cumule with the preemption mode enabled We need to either Use the legacy way to migrate with post-copy or We needed to have both of the binaries to be latest to support preemption mode One good thing to mention is that there is no extra configuration needed for the post-copy preemption feature So why we why we need that feature? Why not we make it at the default? It's simply because of these new capability or say this new mode of post-copy changes The life migration stream or say the protocol between the source and destination So it just won't really work with all the binaries That's why we need a new capability or otherwise. It should really be I would really suggest if somebody is using post-copy It should really be worthwhile to consider try this preemption mode because it It should be nothing but to improve performance on the page for to request the latencies It's just that we need to be compatible with all the post-copy. We need a new capability bit That's all of it. There will be some some test results shared at the end of the presentation Okay, let's quickly go over migration This is how pre-copy work So in pre-copy what we want to do is to migrate a virtual machine live from source to destination On the source, there are two kinds of pages. Well, the first one is called dirty page, which means this page is Modified on the source and There is no latest page copied over to the destination So we need to migrate those and there are also clean pages Which means we could have migrated some of the pages and they didn't change along the way until now It means this page are kept the same content on source and destination Even if source is running as long as these pages are not written or updated These these pages are clean and we don't need to migrate it again so When we migrate page in the page the so-called page stream, we only migrate dirty page never clean page On the other hand on destination we could have Quite a few kind of pages firstly there can be clean page as I mentioned because they are identical to the source There can be missing pages, which means we haven't yet migrated any version of this page. So it's missing There can also be store pages Which means we used to migrate this page, but the content of the page changed on the source It means this page even if migrated is store and we need to update this one These these pages will finally be discarded When pre copy completes, we will see all the pages are Clean pages on both force and destination What we need to do right now is just to switch over the running state from source to destination and That migration is complete Now and then we can destroy the source instance safely so But if we want to run pre copy, but run The virtual machine on destination before everything is copied over we call it post copy So what we can do here is firstly we know the store pages. So assuming we have the way to trap When the page is missing we can Have a way to trap the page votes Because store page won't trigger any page fault. We probably need to discard them beforehand So this is what post copy looks like with that. We just firstly we firstly drop the store pages Into missing pages. So we just got more missing page and then We start running the virtual machine on the destination When we access a missing page, we will stop the virtual machine vcpu thread send a page request to the source which I used a Green diamond to show here and then the source virtual machine will send Those requested pages back which I call it our urgent page So in post copy, actually there are two kinds of pages unlike pre copy First one is the background page Which we used to call it the clean pages that we That we are migrating with In the background these pages are not really accessed by the destination yet and there are also some of the pages that are requested explicitly from the users from the Destination virtual machine So they requested Using the page requests in the message channel and this channel is also special to post copy because It does not really exist for pre copy, of course It's not it's optional actually existed but it's optional for post copy. It is required What we do here is we just send The urgent pages alongside with the background pages if there is any urgent page we handle them and we queue them into the same page stream so that Ultimately the page will be resolved the page foot will be resolved on destination so We quickly went over pre copy and post copy Especially on post copy. We do have quite a few limitations Firstly, there is a risk of a split brain For example, if the network fails during post copy as we know we are doing remote page faults It means we cannot do the page fault anymore and the thread can potentially hung for a long time and if We use a very old Q mu it also calls directly directly split brain and the guest will crash After Q mu 3.0. We have a post copy recovery which covers this case so that we can Resume the migrate the post copy migration after the network is recovered The second major issue is about the high paid request latency for pre copy actually there is some effect on tracking dirty pages, we do have some penalty as well, but not as large as When we are using post copy It can be shown in different aspects firstly, especially on huge pages We need to service a page fault By copying over the whole huge page For each of the huge page it can be as large as two mega or even one gig Actually six and it's a varies on different architectures But to majorly it's the same idea that a huge page can contain a lot of small pages and the migrate That huge page can take a lot of time already. It means the paid for the latency can be drastically huge upstream Linux actually has something called a huge TRP double map or the name could be sub page mapping or something Anyway, the name is prone to change so far and For whatever name it will be It is solving the problem of having the huge terapy pages mapped to small e For example 4k on x a x 86 So that it Provides us with a mechanism we can migrate Huge page backed virtual machines just to like when we are using 4k pages So that is a great thing to have Alongside we will not Lose the TRP heat and all the huge page benefits before or after migration. We only lose it during migration But it's fine mostly Even if we can have this For kill mea We still have some other issues even on 4k Because the page transfer are really slow For kill mea if any of us like to measure So what I did a test with 10 gig network for one busy busy random access over actually over a Large range of memory I got an average of 12 milliseconds per page request In average that's So it can be more or less, but in average it's 12 milliseconds, which is quite large for a 4k page It is not really something so It's not about the network is so long. It's really about the software overheads Which is not really necessary and we really need to look into that to optimize and make it faster because this test was really Tested based on directly directly attached hosts and really shouldn't be like that so What is the major problem underneath? firstly the major problem is Is that we are using a page stream if we still remember at this page is actually a kind of an amplified version of previous picture we Had just now is it have just removed the sum of the components and There is an emphasize on how page stream is handled If we still remember when we get paid for the requests We will queue the urgent pages in the page stream Which is showing in the green blocks But the problem is Before we do the in queue There can be background pages, which is thrown in the yellow box blocks We could have had all those in the same buffer already which means We can't service a page fold by the green block before we flush all the red yellow blocks so For cumul we can't do something like that. For example, we have the cumul file. We have the buffering. Maybe we can do this Somehow it will be called awkward and it could be also challenging Because probably we need to manipulate the buffers Maybe we need some memory movements maybe we need Some tricks and even not to mention about cumul file Let's imagine the page stream as a TCP socket and it has send buffer. We still cannot really overcome with Send we could have some background pages Acquired in the send buffer, which we cannot really see from the user space So it's kind of a kernel thing and we cannot really I mean this send buffer is really useful for us To have good throughput for example we can't really or We could consider using the out-of-band messages for TCP, but unfortunately, I think that is not designed to send a lot a lot of pages for example the page requests can be a Lot if especially if the vcp Vcp number is large. We can have plenty of paid requests coming to the source So we can be sending a lot of urgent pages, which is a lot of data It may not be suitable for a message OB for TCP So what we but what we can do here is maybe we can directly separate the channel like this So we could have the one channel Only to send the background pages and if there is any urgent pages we use another channel And To make it even better. We have a path path page resolve thread on the destination just to resolve this page vote so This is this needs some rework on the migration logic because currently we do have a lot of global states to maintain RAM information but After this change the from my management The page for latencies can be greatly reduced So this is the first issue we are tackling with There is another issue it's about a huge page granularity Q mu sends pages always in huge page granularity for example If we're sending a background page and if we are using a huge page on the host to back the guest pages We can only send a whole huge page and after that we can send another huge page So the next huge page can either be a background page or it can be an urgent page that is requested Which means? We cannot really interrupt Sending a huge page Why what why is that it is because Q mu actually has Has a received buffer to buffer the huge pages We cannot to do that we cannot directly copy the data into the guest memory Because otherwise the guest will see partial huge page So what we need to do is we have a huge page buffer We cash all the data until we received the complete the huge page and we atomically Update the page table with that page It means we need some kind of buffering and we cannot Have a lot of those buffers We cannot send the first page of the huge page one and the first page of huge page two We cannot really cash huge page one two three four five and forever. It will use up a host memory easily So so far on we only have one Temporary huge page and that's why we can only send one huge page one at a time So if to look at look into this it is really another thing that it will make it even slower When we are working with huge pages And when there is a paid request So if we see so I further amplified this assuming we have the urgent page stream already If we have we are when we are sending page one two three four assuming it will form a huge page When we are sending page one we got a request saying that we want the huge page five six seven eight We can't really do so. We need to wait until the two three four pages are sent So There's these are all extra latency overheads So basically we are blocked before sending the whole background host pages So what we can do though is We can consider interrupting the huge page sending For a background page after the first page first small pages sent we quickly switch over to the urgent page request Send it to send it to write away Afterwards we can resume with page two three four and actually with Multiple channels it is achievable because we will actually have multiple buffers for the huge pages anyway But we need extra logic on the center side to make sure that this will be a Triggered and the recovery will be properly done So about the last issue it is about that really about the migration thread itself. So after we apply solution one solution two we will see obvious Reduction on paid for the requests already Especially on 4k. So so the solution two is ready for huge page only if for small page Solution should solution to doesn't really work majorly for solution one Which is the channel separation part it will bring in my measurement about 20 Times speed up So but we can still do something better than that Which which is something If we try to think about it it is really about the migration thread that is due the major bottleneck in that When we are sending a background page using send message If we remember the buffer can fall and when that falls when that is full It means we can be blocked By when flushing the background buffers the thread can be blocked the channel is three We have a separate channel, but the thread is blocked there So we need some way to work around this Probably by using a separate channel To to send the urgent pages even if the migration thread is blocked waiting for the nick To a free Morrison buffers So it will not make a slow down the urgent page requests handling So Why we use migration thread as the only Thread to migrate a page Because of many reasons majorly is about a legacy state maintenance because many of the RAM states are managed in Migration thread only is global. So basically we need to make sure many of these can be certified and There are a lot of features actually has a dependent on the main main thread So what is the solution for this? We probably need to refactor the global states into something like a per channel or per thread once Especially if we can have one thread to work on one channel What I was trying with right now is turning the page search status into a per channel structure So we have one for each of the channel. Basically, we have one for the Page requests as well and the other one for the background pages and We have we need to have a way to manage page ownership which means When we have because now we can have more than one thread sending pages We need to know which is this is the question about the who will send the which page So it's actually a very simple Because we have the dirty bitmap anyway, even for post copy will have the bitmap ready So anyone who take the ownership of the bitmap by clearing it from one to zero We'll own this page and the bitmap is protected by the bitmap mutex. We can also previously it was okay to use Atomic operations, but this one for this one now the bitmap mutex is protecting actually more things More details we can look at The patch sets later on I have the links So and the one thing to make sure is we really need to release all the global locks during sending for example send a message and It could block so we should not have it block other threads from running and So with all the facilities ready, we can send pages outside migration thread So we can have create a new thread doing this or how about we just send it in the return thread We have the return thread receiving the page requests and Actually, the fastest way to do this is to to send the pages back as soon as possible right after we receive it and With this we can also drop the page request queue because it does not need is not needed anymore Then all the urgent pages will be received in the same thread and the send in the same thread It has nothing to do with the migration thread anymore So this is a recap on previous previously what we have with a separate urgent page stream channel, which looks already good enough and Regarding solution 3 it will be something like this one So We can see firstly the page queue is removed. We don't have that. We don't need that and For the two channels we are delivering We are moving the ownership of the urgent page stream from the migration thread to the return thread and Since we have two threads running in concurrently we can actually send pages unlike previously if you see We could only send the one page at a time even though when I say send I mean queue it into the buffers But we cannot queue at the same time. We need to queue it one by one, but we can send it So the showcase are separate, but the thread is still share so that in the new layout everything is concurrent Everything can happen concurrently as long as there is no log contention Actually, there can be but as long as we Proper release the log Properly then we will mostly be running in parallel and And if you if we see on the destination There is no change at all. It's only about on the center side and take ownership and That's all of it As simple as that so it's I don't think it's a very complicated thing, but it really really helps on reducing the latency So here are some performance numbers The test was carried out with a virtual machine with 20 vcp use and a 20 gig memory With one busy random write workload over 18 gigabytes the test program is pasted here Which is a tool that I normally use for migration tests and also a script that I try to catch capture the page for the latencies Know that I only capture real page faults. For example, when there is a major page fault. I don't trap minor page faults For example, if the page is quickly can be quickly Page fault can be quickly resolved. It is not a major page fault It means the a major part page fault means we will generate a message to user for FD So the result for vanilla post copy if you still remember it takes an average of the 12 Millisecond, which is 12 actually 12,000 microseconds with the preemptive full solution One two three applied it only needs a 229 Microseconds for each of the paid requests in average, which is a 50 Times speed up and this is a quick distribution of the latencies So we can see that The the the both one is vanilla post copy and the one below is the preemptful so We can have a reference mostly the vanilla post copy page requests force into the eight eight milliseconds to 16 millisecond the bucket mostly because I think of the Pages being blocked by the background pages and the preemptful is majorly only need a 100 or to 200 Microseconds in general So this Post copy preemption work is done in two parts. The first part is called the let's call it part one is already merged in 7.1 Including solution one two. It provides already around 20 Time speed up for random access so for post copy preemption part two I have RFC posted and Which include a solution three only and it's during review. So any of the comments will be greatly welcome That's all of it. Thank you very much