 Welcome everyone. My name is Florian Schmidt. I'm an engineer at Nutanix and together with my colleague Ivan Teter-Evkov I'll present lessons learned building a production memory over commit solution and The title really already says what we set out to do We wanted to create a self-adapting memory over commit solution, which sounds easy, right? well Obviously, there are some pitfalls here and we soon realized that we can't write everything from scratch Especially not our own memory management So instead we decided to leverage existing technology and it turns out that that was probably a good choice because a lot of Building blocks for such a solution already exist So we could use the Linux memory management system C groups for control the verdiable loondriver inside guests procfs for stats collection and so on and so forth and The only thing that was really missing was a central tool that tied all of these things together and This talk is about design choices and challenges that we faced on the way Now when we talk about memory over commit there are two practical solutions that are in wide use for that ballooning and hypervisor swap with ballooning you have a verdi or driver inside the guest that Allocates memory and sits on it and then gives it back to the hypervisor for used by other VMs the advantage here is that the guest can choose which memory to give up and The guest generally knows best This might also not cause swapping because it might choose to give back memory that is not even in use at the moment The big disadvantage is that this requires guest cooperation. So if there's no balloon driver or the balloon drivers broken You're out of luck and also the state gets lost on reboots So memory that was handed back might suddenly be used by the VM again breaching the contract basically The other option is to use hypervisor swap where you treat a VM basically like any other application You swap out parts of it when you need more memory and you can control how much each one of those uses with C groups The big advantage here is that no guest cooperation at all is required And when you first set out to build our solution We thought well the balloon driver sounds nice, but if it's not really supported by By every guest we want something that is you know applicable everywhere. So let's just go with hypervisor swap Of course hypervisor swap also has some downsides and the biggest downside is probably its performance Especially when you are hit by a problem sometimes called double swapping and let me explain to you what that means So here we have an example set up we have some memory Part of which of the host memory is used by the VM and then we have swap at the host level and we have swap at the VM level now If the host decides that the VM needs to give up memory It can identify some idle memory and swap it out to the host swap Shortly after the VM might also realize that it is under memory pressure Search for some idle memory and decide to swap it out except now to swap this same page out It first needs to swap it in just to swap it out again on VM swap Even worse if the VM is under memory pressure chances are to even swap this memory in You first need to swap out some memory to swap in some memory to swap out some memory And you can probably see where this is going you create a lot of additional IO and can easily end up thrashing badly and The most insidious part about this is that the better your memory management system works Or the closer the ones in the VM and the guest and the host are the more likely it is actually for this to happen Because they will identify the same idle pages So I think we have established that ballooning might not always be available or reliable But hypervisor swap can have severe performance issues So the solution here is pretty straightforward We want to combine both to get the best of both worlds and the guiding principle would be that you use ballooning Where it's possible and you fall back to hypervisor swap where it's necessary So simply if you want to shrink a VM You would first try to balloon out some memory and only then reduce the c-group reservation of the VM and Conversely if you grow a VM you first grow the c-group limit and then you balloon in the memory that is now available Of course as I said if the balloon driver isn't available or just doesn't want to comply Then eventually you give up and you just use hard c-group limits Another problem that we saw is that When a VM starts swapping Then it needs memory and it needs memory fast because the performance is tanking But if you use up all your memory then to grow one VM you first have to shrink other VMs Before you can grow it and shrinking a VM can be a quite slow operation IO is of course slow, but also the ballooning API is not the fastest in the world So a Solution here is to keep some buffer memory At all times that is not in use by any VM and then when a VM needs memory We can quickly grow into that buffer and then asynchronously after the fact we reclaim memory from other VMs to replenish this buffer of course, there's a trade-off here between Reaction speed so we react quicker But we reduce the overall memory efficiency because now there's some memory around that we don't permanently assign to any VM But this seemed to work for us pretty well to increase the reaction speed Now I talked a lot about growing and shrinking VMs But we need to make a decisions which we have to grow and shrink and for that we need some stats to make these decisions and if we have a balloon driver then we're actually in a quite lucky position because The balloon driver is not only a control interface. It also collects information Inside the VM and makes it available to the host and so for example, we can get information about how much Swap in and swap out has happened at a VM level Which is important because from the outside we can only see IO and we can't know whether that is swap IO or other IO We also get information about how much memory might be reclaimable from the VM There's a stat called usable that is Provided by the balloon driver, which is defined as The amount of memory that a VM can give up before it starts swapping and that is exactly what we want to know for our scenarios If we don't have balloon driver information or maybe in addition to it We can also have some stats available at hypervisor level. So Swappings from hosts level swap we can identify because the number of major faults of the chemo process, which owns the VM memory Will increase Conversely, there's not really any single stat that allows you to identify host swap out Which might be quite valuable. You can try to come up with some rough heuristics to estimate that but it's it's tricky and If you're talking about reclaimable memory again Then what you can do is you can do some working set size estimation, but more on that later in this talk So now we have ways to grow in shrink VMs and we have stats So now comes the point where we tie these together into an algorithm and add a basic level our algorithm works like this we Categorize VMs either as needy Which means that there's some swapping going on like some not insignificant amount of swapping and then we say the VM needs more memory Or a VM can be greedy. That is it's not swapping, but it has unused memory to give up Now one problem is for these greedy VMs We saw that we have ideas of how much memory is reclaimable, but conversely We can't know how much memory and needy VM needs to stop Swapping and being needy it could be a little it could be a lot We can't know so the best that we can do is we can give it some memory Then check whether that improved the situation and if not give it some more memory again So the algorithm goes like this You look at the VMs that are running on the system and you order them by their neediness So you can see here for example the red VM has a lot of memory pressure a lot of swap happening The green VM is not swapping as much, but also so if you order them You have the red green and then the the two VMs that are not showing any neediness at the moment and then You hand out based on that list position so the red VM gets more memory than the green VM and the blue and yellow VM give up some memory accordingly and Then you come up with this plan and grow and shrink the VMs accordingly and then you rinse and repeat that Over and over again. Of course, there are a lot of special corner cases and Situations that you have to think about but I will not go into detail in this talk And let's just suffice it to say that this is the basic idea of this algorithm Okay, I will now hand over to my colleague Ivan for the rest of the talk Hello, and let's talk a little more about the metrics and stats we collect about the running VMs As Florian said, we have quite a few available at the host level such as Proc of S files and we also use the virtual balloon driver stats reported about the in-guess situation This is useful. However, the virtual balloon virtual your drivers are not always installed in guests or they may be malfunctioning for some reason In this case, we've been looking for more stats at the host level to get more insights about the in-guest workload And the working set size is one of them It's fairly it's fairly accurate and it's reliable always there. We can trust it Practically high estimate means that you need some memory. There are some corner cases But they are other one of situations that handle the metric is based on the page tracker available in the links kernel As many of you know, it's a bitmap interface indexed by the page frame number PFN within the host address space We could of course, we could check all the pages available within the host get the Working size estimate for the VMs, but this is not practically achievable at scale because of our harness of gigabytes or terabytes of memory so we need to do some sampling and We've been considering two approaches. The first one is to sample the guest address space with the page map interface So you could get the sample set within the guest address space map it to the host PFNs and give it to the IPT This this is possible but this comes with a computational overhead more guest address spaces You have more CPU cycles You need to burn to filter out the PFNs and things worsen with the memory because you have fewer more guests within the same address space on the host The second one is Getting the PFN sample within the host address space and filtering to assign it to the running VMs with the KPHC group interface It's another binary file that says this PFN is being accounted by this C group Partition within the memory controller and this is possible in our case because we put each guest each VM into its own C group partition The computational overhead is still there, but it's constant and it doesn't depend on the number of running guests So we chose this approach We create a sample set of uniformly distributed PFNs within the host address space Then we map it to the running VMs with the inode KPHC group interface then we give it to the IPT and get the raw report This report is inherently noisy because we use the random sample set random distribution and therefore we need to do some post processing We use the moving coverage to get the fairly stable and settled metric and this metric is used along with others to make decisions about whether we have needs memory or Could give it away Now let's talk about the issues related to life migration Firstly, I want to highlight that we use shared memory This is because the storage data path for the running guests is handled by a separate process and this process handles the in-house developed storage data fabric in Nutanix and It shares memory with a Kimu This is important because shared memory handling differs from private memory in the Linux kernel and more you dive into the source code within the kernel more you realize the differences Now let's consider an example We have a guest and let's say 75% of its memory is present in the host address space and 25% not mapped either because it's blown away or Never accessed. This is possible for Linux guests and for simplicity. Let's say the host swap is not used for this case the C group the The C group limit is in place and it's nicely aligned with the currently present memory in the in the host address space Now let's migrate it with Kimu from source to destination Kimu must read the entire address space of the guest at least once and Transfer the either the content on the control message that the page is zero to the destination When it happens, it also accesses the pages that are not mapped and as a result of the check whether the page is zero or not the kernel handles this Minor page fault allocates a new zero page gives it to the Kimu then Kimu deals with it when it happens the other you policy kicks in and then it replaces the Inactive pages present in memory with the zero pages just allocated for the Kimu zero check and as a result we have few problems At the end of the iteration we have unnecessarily located zero pages. We didn't want them in first place We also had some swap IO happening that we could have avoided all together and we also mangled the host Sorry, not the host the DMs working set because if for example the guest wants to retrieve the Pages from the host so then this would be a even further performance drug during the life migration This could be avoided avoided all together and there is the way to work it around. There is a recently fairly recently introduced M-Advice called Hint to the M-Advice system call. It comes with the Linux kernel 5.4 Essentially, it allows the user space to say hey kernel. I have this address range. I don't use it it's called whenever memory pressure occurs, please take it away and This is what we're using in the Kimu We go to the is zero page check and before actually performing the check we consult with the page map interface Check of his if the page is present or not If it's not present and if it's zero then we could reliably conclude that more or less reliable conclude that it's not It was allocated as a result of the zero paycheck So we definitely want to get rid of it We mark it with the M-Advice called and the kernel whenever the pressure occurs the kernel simply discards it because it's a clean zero page So this is how we work it around the second problem related to this configuration is that Let's say we have the guests with a similar config But it uses the host swap before performing the life migration In this case whenever the Kimu accesses the swapped out pages It replaces some cold memory some cold pages present in memory with even colder from the host swap And in this case it replaces it mangles the working set size in a similar way to the previous configuration From the guest point of view it looks It looks somewhat incorrect because it has some working set and Now some part of it is being replaced some cold memory is being replaced with even colder memory Which is not we not what we want and In the end we also could avoid it the swap out of the cold pages that currently present in memory for the guests and There are ways to improve it. Obviously we also mungled the working set of the guests similarly a Workaround a solution to this problem is using the page out Hint to the m3 system call. It's similar to the previous one. It comes with the 5.4 kernel and It allows us to say hey kernel. I don't want this pages. Please page them out And since we have them already paged out and the entries present in the host swap Colonel simply discards them So it saves time The approach is fairly similar. We go to this is zero page check We check the page transferred over the destination It goes to the swapping of the page, but then we call the page out and avoid the page Swap out of the pages and this gives us the performance improvement a Problem with the shared memory is that The page map interface does not report the PM swap bit for the pages If they shed memory this this is not a problem for the private memory So from the user space point of view whenever you have the shed memory It's indistinguishable between whether the pages swapped out never allocated and this is a problem in our case There are ways to improve it Firstly, we currently patch the Linux kernel and improve the page map implementation to query the swap cache This comes with some marginal performance degradation, but it's not a problem at all and the second approach is to use the L seek And skip never like it pages with a seek data or sequel and another M in core system call Which is useful. We've been considering it. Well Obviously, let's say we have the We have the guests that has some Memory swapped out to the host swap whenever we do the live migration We probably don't want to transfer. We don't want to page in the pages if it had shed If had shared hotswap file, we could simply Attach it to the destination and then avoid transferring the pages all together. We could only transfer the metadata However, the interface around the host swap is somewhat limited and my colleagues at the Tonics currently working on the feature And hopefully we'll see they're working the next given forum At this point, I want to show you the last slide lessons learned The important lesson is that Swapping and the ballooning along don't work. They have issues with the performance and the hybrid the hybrid approach is the way to go for us We also considered plenty of metrics and deduced need and greed we implemented the custom algorithm On top of that we learned a very important bit that shared memory isn't private memory. It's it differs It's substantially differs and more you dive in the sources more or less that There are also current corner cases around implementation of the certain user space tools kernel space space tools There are ways to improve Kimu and There are also ways to improve the swap subsystem memory subsystem in the kernel nevertheless the Linux ecosystem around them Virtualization is decent and it provides the solid foundation for memory or commit what's needed is just to tie all these pieces together with some sort of control plane and That's essentially allows you to build a memory or commit on Linux And this is how we did it for our customers and now it works and to build enterprise gate product Linux is cool and thanks for watching. Hope you also learned lessons with us and Stay safe. Stay tuned. Goodbye