 Hi everyone, I am Manish and along with me is Sium. We work with the HV hosting in Nutanix So our presentation Today will be around some exploratory work. We did on evaluating Intel surface protection feature in context of live migration So this is our agenda for today's presentation I will start with what kind of issues we are trying to tackle in live migration and how something like SPP can help there What is SPP? How we integrated SPP in live migration QMU KVM flow and what kind of initial results we got Then what is currently pending to be done? possible challenges with using SPP for live migration and what we have in bucket for future work in this area So everyone must have faced non-converging live migrations and understands how frustrating it is Sometimes even though live migration completes it can take hours In pre-gopy live migration time taken by a live migration mostly depends on two factors first one is amount of data it need to transfer and second one is rate at which it can transfer data Rate at which a live migration can transfer data is limited by available network bandwidth in most cases and we cannot do anything about that Now live migration can take huge amount of time because either guest its self is starting at very high dirty rate or Network bandwidth is too low to migrate a workload even with the more dirty rate So low network bandwidth is very common for live migration which goes over a wide area network We specially want to improve these cases where we have to migrate a VM over wide area network Now how we can reduce time to migrate in these cases we can basically try reducing amount of data which needs to be transferred But how we can reduce it We can try different compression algorithms which are currently present in QMU But all of those has significant Side effects either in terms of CPU utilization or memory overhead Also those may not be that effective depending on workload Now we want to evaluate a new SPP based up close to reduce amount of data and evaluate what kind of pros and cons it has Like for what kind of workloads it works well or may not work well What kind of potential side effects side effects it can have in terms of guest performance as well as post overheads Also SPP may not be an alternate for existing Also SPP need not to be an alternate for existing compression algorithms compression can work on top of SPP also There was a similar exploration done by Yosuke Ojawa and Taka Hiro of Singawa But that was not done on actual hardware and was emulated We have put a link in reference for more details If someone wants to read about that paper So what is SPP? SPP of sub-page protection is a new feature by Intel starting with Ice Lake servers It allows us to enable or disable write protection on the sub page of size 128 bytes Basically for every 4k page we have 32 sub pages of 128 bytes So how it works? SPP provides extra page table support or SPPT along with EPT for storing access flags for any GFL You need to set SPPT in every VMC similar to EPTP Now L1 entry of SPP page tables contains access permission vector of that specific GFL L1 entry of SPP table is a 62-bit entry basically 2-bit for every sub page of size 128 bytes Out of every 2 bits one is reserved and other one indicates access write of that sub page Value 1 means write access is allowed on that sub page otherwise it is not allowed Now how we come to know if sub page tracking is enabled or not on any GFL Basically L1 entry of every GFL in EPT has SPP bit If SPP bit is set that means sub page tracking is enabled for that GFL And we need to traverse SPPT for getting write access permission for that sub page Basically in normal older workflow there was just GFL level write protection You could basically unset write bit in L1 entry of EPT page tables And enable write protection on full page But with SPP support you can enable sub page level protection for any GFL By marking SPP bit along with unsetting write bit in EPT And then you can control sub page level access by setting or unsetting access bits For every sub page in L1 entry of SPP SPP protection works only with 4K pages So if you have large page mapping first you need to break those pages And then you can enable SPP protection Also sub page protection is active for GFL only when both write protection is enabled As well as SPP bit is set in EPT for that GFL Otherwise SPP bit is ignored So bit 62 in L1 entry of EPT indicates SPP bit Also any VMAG due to SPP protection still comes through EPT violation And you need to figure it out by checking the page tables If this violation was due to normal write protection or sub page level protection Based on that you can act upon But if SPP page tables are not properly configured We get a new VMAG which is SPPT misconfig Now how SPP can be useful for line migration As we cannot as we can do dirty tracking at sub page level Now as you may if we have a workload which that is a 4K page only partially In that case we don't need to transfer full page destination We should be fine just transferring only few sub pages which were dirty This can significantly reduce amount of data that needs to be transferred to destination SPP can basically reduce amount of data to be transferred By fraction of anywhere between 1 to 32 Depending on access pattern of that workload Now what kind of changes or efforts required on implementation side I will start with KVM On KVM side we already had some initial base pages Thanks to Yang-Yen which already had most of the SPP support Things like it already provided IOTL or routing to set or unset SPP access permissions on any range of memory Even for large space back memory Also things like handler for SPP to misconfig And base fault modification to dynamically apply SPP protection To avoid overloading of set and unset IOTLs Or APIs We just have to integrate those patches with live migration related workflow So on top of the base patches by Yang We have to do few initial few additional things to use SPP with live migration Some of those are like what we have listed First thing rebasing patches So those patches were meant for kernel 5.1 We rebase those patches to kernel 5.10 And we'll soon be doing it for TDPM and upstream master Also we found few bugs in base patches While testing with Qmone scale which was probably hidden with self-test So fix those bugs Then we integrated enabling SPP protection through set memory region If dirty logging is enabled Also we integrated it with get dirty and clear dirty IOTL To fully fit SPP in live migration workflow Then managing different sizes of it Based on if SPP is enabled or not We replaced most of the mark page dirty To mark surface dirty Wherever we could safely or confidently do it Other places if SPP is enabled We call and if we call mark page dirty We mark all surface dirty for that page So that kind of bitmap management we had to do Then we optimized TLV flushes by batching TLV flushes And also only doing TLV flushes on required ranges Instead of doing full TLV flushes Also tried to bring TLV flushes out of critical sections Wherever possible Then other small fixes we made Like handling vcpu and memory hotplug after SPP is enabled So this is what current KVM side flow looks like First we have SPP capability If it is enabled live migration will be using SPP While enabling SPP capability We update SPP t pointer for all existing vcpus Also we initialize metadata for all existing mem slots That metadata actually holds access permission vectors For every gfn in that mem slot Based on that Based on what we have in mem slot SPP metadata We can later populate ept as well SPP tables If new vcpus are added or mem slots are added or deleted We handle all those initializations dynamically for those cases Now set memory region So based on if we have KVM mem log dirty pages in mem slot flex We enable or disable SPP write protection on memory slot If you want to enable SPP dirty tracking We update all SPP data for that slot to disable Write access on all the sub pages Also we mark SPP bit if for gfn 4k or l1 entry is present in ept If it is not present This is done dynamically while handling page hold on that gfn If memory in that slot is back by last pages We invalidate those mapping so that those can be converted into 4k entries on page hold Also we invalidate all SPP page tables if already set up So that we can get SPP to misconfig and set up page tables there We never populate SPP tables in ept base holds It is always done by SPP image config Similar things we do when dirty log is disabled on mem slot We update metadata Then we remove SPP bit from all 4k entries And finally plus invalidate SPP tables Again most of this flow other than line migration part Was already covered by Yang in his base patches Then during clear dirty iotl or get dirty iotl We don't need to make any update in ept We just need to update SPP metadata in mem slot And plus or invalidate SPP tables So that those can be rebuilt With latest access write on SPP in this config Now what happens when there is a direct page hold Or one which cannot be handled by fast page hold If there is a 4k level mapping done in direct page hold We also check SPP level access vector for that gfn And based on that we can update SPP bit in page tables entry For that gfn in ept Then this is what is modified in fast page hold for svp If you are in fast page hold That means ept page tables mappings were already done And page hold was due to write protection Also if there is page hold with svp bit set That means svp table is already set for that gfn By some svp t misconfig earlier So in that case we just need to flip access bit For that specific sub page in svp tables Basically a sub page is already marked That is for a live migration iteration We can disable write protection on that sub page Until next get that you are clear that I have called Now we get svp t misconfig for any gfn If svp tables are not mapped until last level For that gfn If we get svp t misconfig If we get svp t misconfig We initialize all pts up to a live level So this covers most of kvm side workflow Now Siam will take over to explain kvm side workflow And the remaining sections Hi everybody Let's discuss the kvm support for svp now This is a brief overview of what we have implemented on the kvm side To summarize we have added code changes to add and manage separate Bitmap for sub page level tracking And we have kept the number of copies Bitmap copies to just two rather than three Which is the case with age level tracking And we will discuss the reasons for this in the later slides We have also added support for transmitting and receiving data At sub page level granularity Let's discuss some results we have obtained These are just the preliminary results And we are looking for to share more results with the community in the near future So for now we have tested our implementation On these two different workloads Nutanix, vcvm and kernel build process And this is just to highlight How the workload behavior can dictate the effectiveness of sub page level protection So Nutanix vcvm is a vm which runs workloads that manage Distribute systems in Nutanix clusters And it is able to drive a good enough dirty rate of around 0.5 gigabytes per second The second workload is a kernel build process Which basically can drive a very good dirty rate So we will discuss the results on these workloads But before that let's plan out our evaluation How do we evaluate the gains or losses with SPP So we will start with observing the memory access pattern of the workload And we will be considering only writes So what memory access pattern means is that If the workload is dirtying a page in agr penetration Does it dirty most of its sub pages or just a few of its sub pages So that becomes an important question Another important question is how much we will be able to reduce The total volume of data to be transferred across via migration And this can be simply answered through the memory access pattern information And one more important question is how the network throughput is impacted with SPP Because there will be an overhead in transferring data at sub page level granularity Which is a smaller granularity of transferring data Okay, but once we have these two metrics reduction in total volume of data to be transferred And the impact on network throughput will be good to Measure the improvement in time to migrate And this becomes the basis of our comparison Of live migration with SPP and without SPP support And note here that our implementation to the day On the day this video is getting recorded is still not very stable So some of the data which we have obtained is from offline calculations So these two histograms represent the memory access patterns for Nutanix PCVM And kernel build respectively So basically this represents what percentage of the dirty pages For what percentage of the dirty pages only one sub pages have been dirty And for what percentage of the dirty pages two sub pages have been dirty and so on So you can see that for Nutanix PCVM for more than half of the pages which have been dirty Only a few sub pages have been dirty But for the kernel build process for most of the pages that have been dirty Almost all the sub pages have been dirty This it is pretty clear from these diagrams that Subpage protection can be very effective in workloads like Nutanix PCVM Where most of the pieces that have been dirty For most of the pieces that have been dirty only a few sub pages get dirty Okay, so from the memory access pattern We can deduce the percentage reduction in data percent and data to be transferred across five migration And we can see that for PCVM The data to be transferred decreases by around 60% which is great But for kernel build it is around 15% which is decent but might not be good enough Okay, so now let's compare the time to migrate With and without SPP for both the workloads So you can see that for Nutanix PCVM We see a significant decrease in time to migrate with SPP And for kernel build the decrease is not that significant So you can pretty it's very clear from this that For workloads like Nutanix PCVM Will be this SPP live migration with SPP will be pretty effective Also note here that the effective throttle rate Decreases with SPP And this is because the throttling logic is not Is flawed for SPP So we need to adjust the throttling logic so that We can match the thought effective throttle rate with SPP to the level without SPP And once we do it, we expect that this time to migrate will further go down with SPP So the improvement will further increase Okay, but now let's take some time to discuss the limitations of SPP in live migration context And we'll start with cases where the network throughput is very high Because in these cases also there'll be a reduction in total volume of data to be transferred across live action just like the case of low network throughput cases But this gain in terms of data, in terms of reduction of data to be transferred Will be overpowered by the impact on network throughput Due to the overhead involved in transferring data at a smaller granularity And you can see in this chart that how max network throughput drops significantly with SPP And this just shows that for cases with high network throughput Live migration with SPP might not be very effective Also live migration with SPP may not be effective for multi-FD as well Because multi-FD already features high CPU utilization And this CPU utilization can increase further up to four times with SPP And lastly, SPP is not possible for zero copy as of now Because we have to maintain 4k level piece mappings with zero copy And so zero copying at sub-base level is something which is not possible Okay, so what are the challenges with SPP? So the major two challenges are the memory overhead and the impact on guest performance So with SPP we'll have 32 times larger bitmap We will have to maintain 32 times larger bitmap And we will have to maintain extra SPP paste tables to track at sub-base level on the KBM side And these can be pretty used for large memory VMs Also, if the bitmap size is large, the time spent in bitmap sync will also increase significantly And due to that, we will be holding the MMU lock as well for a very longer period Which means in extreme cases, the VM can hang as well due to this large bitmap size Also, there will be a huge impact on guest performance Because with sub-base level tracking, there will be an increase in number of VM exits by up to 32 times And also, PML is not supported with SPP, which can add salt to the one further So to measure the impact on guest performance, we have gathered this data on ready benchmark So you can see that the throughput decreases significantly with SPP Also, the latency increases significantly with SPP So this is a big challenge and we'll be discussing in the later slides how we can tackle this So how can SPP help? So SPP can help in converging those edge cases where the network bandwidth is very poor Provided the workload behaves in a way that it dirties only a few sub-pages when it dirties a page And in those cases, SPP can be pretty effective Okay, but the implementation we have right now surely have a lot of scope of improvement And that is what we'll discuss in the upcoming slides But before that, I want to highlight the current status of our work So we are working on those optimizations And once those optimizations are done, those basic optimizations are done We look forward to test our implementation rigorously on different benchmark workloads And once that testing is done, we'll be sending our kernel and chemo patches for review to the open source community Our future work would be mostly around answering this question Which is how to counter the large bitmap size and degraded guest performance One of the approaches would be to increase the granularity of dirty tracking to a group of consecutive sub-pages For example, let's say four sub-consecutive phases or 512 bytes Which means that our bitmap size would decrease by four times And the number of VMXS would also decrease by up to four times And that can help us counter both the problem of large bitmap size and the guest performance as well Also, another approach would be to selectively track dirty at sub-pages level For example, let's say that we start with sub-pages level dirty tracking for each page But once we realize that a few contiguous sub-pages have been dirtied for a given page We can switch to normal page level dirty tracking for that page And how effective this can be totally depends on the behavior of the workload Another space optimization approach would be using hash map for tracking the sub-pages For tracking dirtying of the sub-pages So we'll basically maintain a hash map which will be indexed by the sub-page number And that hash map can help us save a lot of space Another approach or another thing which we are looking forward to is to reduce the number of bitmaps Maintained by KMO further to just one and thus we'll be able to eliminate the bitmap Same time overhead on the KMO side and also there'll be a decrease in memory footprints And maintaining extra bitmap Okay, another very important thing which we are looking forward to is Integrating our implementation with dirtying Which can basically help us reduce the bitmap sync time And it can help us reduce the memory footprints on the kernel side Because we can get rid of the kernel copy of bitmap Okay, so we are majorly focusing on two ideas One of them is that we know that sub-page protection Live migration with sub-page protection is not a unique solution So we want some sort of intelligence in our live migration algorithm So that we can dynamically assess a live migration situation And turn on or turn sub-page protection on or off Another thing, another idea we are working on is to dynamically determine The optimal sub-page protection granularity For example, we say that we can increase the granularity of dirty tracking To four consecutive sub-pages and just to decrease the bitmap size And improve guest performance by reducing the number of bitmap VM access But will 512 bytes granularity is optimal or not? Is something which we want to ask every time a live migration is going on At what is the optimal granularity for dirty tracking for this given live migration Okay, so both of these ideas depend on Like both of these ideas work on two key metrics Which we have been discussing all along our presentation And those key metrics are memory access pattern And the impact on network throughput Because with increased granularity It is likely that we'll be able to reduce the impact on network throughput But again, at the same time, the volume of data to be transferred will also increase But using these two key metrics, we can come up with a trade-off So that we can optimize for total migration time While not impacting the guest performance significantly And not wasting too much memory or time due to large bitmaps So this is all from our side Thank you so much for attending this presentation And I'm looking forward to questions, feedbacks and solutions from you This is our team Thank you