 Hi, everybody. Welcome to this KVM Forum 2020 presentation about managing guest memory using Brutail. My name is David Hildenbrandt and today with me I'm giving this talk with Michael Zürich. Hi, I'm Michael Zürich and I'm distinguished engineer at Red Hat. Okay, so let's get started. What do we actually mean when we talk about managing guest memory? So usually there are four different things we want to achieve with our virtual memory of our guest. First of all, we often want to speed up migration. So if you take a look at a virtual machine from the hypervisor point of view, any memory is possibly worth migrating because it might contain imported data. But in reality, there is often quite some memory sitting inside virtual machines that is actually not worth migrating. For example, if it's simply free memory. Of course, it's not that easy to identify that memory from the hypervisor and be sure that you don't lose any important data when migrating. So you need some kind of handshake with your guest. The second item is that we often have over-commitment of memory and we want to avoid post-swapping by any means. That means whenever our hypervisor is running out of memory instead of going to swap, we much rather want to temporarily steal unused memory from virtual machines. Because in practice, it happens quite often that some virtual machines have quite some unused or free memory lying around that we can use instead of swapping. The third item is that we often want to control or shrink the page cache in the virtual machine and also sometimes other caches. The nature of a modern operating system is that it will try to make best use of all available memory and that implies using it for caches. In Linux, this is for example done by the page cache which will essentially over time consume most of your main memory. Of course, some data in caches can be dropped without affecting any workload. But from a hypervisor point of view, it's absolutely not clear which memory that might be used for a cache inside a virtual machine can actually be dropped and there's also like no real interface to drop these caches. And last but not least, we often want to dynamically resize virtual machine memory. That means we want to hot plug or hot unplug memory from virtual machines, either automatically, for example, if our virtual machine runs out of memory, or manually by user request. And this also needs some kind of collaboration with the guest cooperation to make it work. Now the traditional mechanism to do all of these things is memory ballooning and just to give you a recap of what memory ballooning actually is. It can be summarized as relocating physical memory between a virtual machine and its hypervisor. And the idea is actually pretty simple. In the virtual machine memory, you have something called the balloon, and the balloon can inflate or deflate. And all memory that's currently inflated inside of the balloon is not actually usable by the virtual machine but by the hypervisor instead. That means when we inflate the balloon, we give more memory back to the hypervisor and take it from the virtual machine. The implementation in an operating system is actually also pretty simple. So there is a driver running in the virtual machine and the guest operating system which simply allocates memory, coordinates with the hypervisor and when it wants to get some memory back for deflation, it simply frees previously allocated memory after coordinating with the hypervisor. All mechanisms are the size of the balloon is controlled by a so called target balloon size, which corresponds to a request from the hypervisor towards the virtual machine to change the size of the balloon. Now this idea is pretty neat and it has been used for decades already. And this is also why it has been used for all of the use cases we just saw. So, for example, when you want to dynamically resize the virtual machine memory you could dynamically inflate or deflate the balloon. And also for all of the other items I mentioned you might be able to use it to some extent. I'm not going to go into detail here because there isn't sufficient time to cover all of the details but there are a lot of issues. And Michael will talk about at least one issue regarding migration. So, what do we want to do instead. Of course this is not optimal so what we see are plenty of extensions or new mechanism to make the whole thing work. And one part is extensions to existing virtual balloon giving it more interfaces or better suited interfaces to get the job done and Michael will talk about these. The other part is new mechanisms, new virtual devices on one hand with a PMEM on the other hand with a MEM, which I will talk about after Michael discussed with our balloon extensions. So, let's try to migrate a guest. Consider an example on this slide. We start with an 8GB virtual machine and then migrating it to inflate the balloon to 4GB. And as a result, only 4GB need to be migrated. Now after migration the balloon is deflated. Here we immediately encounter problems. How is the balloon size determined? The guest will slow down or how? If inflated too little then migration will take longer. To address this issue we can give guest more control over the balloon. And several ideas had to come together to result in our current solution. First, to inflate the balloon, we can make it as big as possible to fill up all of free memory. And then naturally guest needs to change so a second idea is to let guest deflate at any time if that happens. Third idea is that the first thing guest does with a free page is to write some data into it because it's free so it has nothing in it so far. And this is actually easy for the host to detect so we can do away with an explicit deflate operation. And third idea is that we do not care about reporting small 4KB chunks of free RAM which are spread all over the guest memory. More than guests have compaction mechanisms which can this time help create large free pages on other multiple megabytes. That's only inflate with the largest possible chunk of memory that is still tracked by the guest memory management. In these ideas, we get a couple of features which are called free page hinting and free page reporting. Let's look at them in a bit more detail. Free page hinting is another one of the features. It was contributed by Intel several years ago. It was designed specifically to speed up migration. Here's how it works. Well, it all starts by host right protecting all of its memory. That's normal for migration. And host then sends a request to start free page hinting to the guest. At this point guest will take all three pages and add them all to the balloon. And host will start processing the pages sent to it, marking them up so they won't be migrated. And also right protecting them if not already right protected. Meanwhile, should guests need some free pages they simply starts using them, even as host processes them at the same time. Now, since the first thing that guest does when using a page is right into the page, and this page is right protected, this will cause a fault. And host will mark page for migration again. Now, unsurprisingly, this feature is a good fit for migration. It has no overhead unless requested. Hypervisor right tracking is used for migration anyway. So it's easy to reuse. Balloon can shrink without waiting for host to make progress. So guest is not slowing down. On the other hand, it's less than ideal as a solution for memory overcome it. Host needs to request it. And it's not clear when is a good time to do it outside migration. Inflating all the free memory can get expensive. Should we do it often. Right tracking adds overhead to all guest rights, even to non-prem memory. So one of these issues, you have free page reporting, which is in your feature also from Intel. It's designed to solve the disadvantages of the hinting. Prepage reporting is initiated by guests, which takes action when a significant number of new free pages accumulates at this point. Guests take some of these pages by default about one sixteenths of the three pages and add them to the balloon. It processes the pages by marking them as free and then reports that page has been processed to the guest. Now, unlike this hinting guests and wait for host to process the reported pages before taking them out of the balloon. And again, when pages are used, it is first of all written to and this causes a fault and memory allocation on the host. Now, this reporting is a good piece for our commit because guest activates it when memory becomes free. The implementation is also simple. There's no need to play with right tracking which is easy to get wrong. And we also do not need to track guest rights to use pages which is often most of guest memory rights. On the other hand, this feature as overhead to memory intensive workloads at all times, not just during migration. And also shrinking must wait for the host, which can be blocked by host scheduler so it's less of a good feed for migration. There are two hinting features that we have. Before we move on, I just wanted to mention a sundry list of all the balloon related to the items that we have, and some of them we have had for years. First of all, guest free page solutions do not have a way to shrink guest caches like regular inflate does. So we can just bypass the page cache and this is your time of email, which David is going to talk about a little bit later. I don't have a solution but what about, for example, application page caches. Also, balloon still doesn't really support device pass through with the FIO, supporting that is not easy. It needs someone who's ready to hack host site and I'm in the drivers. There's also a slew of old balloon interface bugs, and no one seems to want to fix. Virtual machine memory size is inflate and deflate is very limited. Guest and host page sizes assume to always be four kilobyte, which is not always the case. Out of memory handling is presented Linux, but is under specified and contributions would be most importantly welcome. So let's talk a little bit more about Rudo Keemann next. So the basic idea of Rudo Keemann is actually pretty simple. Instead of exposing your disk image via Rudo block or similar towards your virtual machine. Instead you map the file directly into guest physical address space and make the guest access that disk image similar to an NVDM so persistent memory or also sometimes referred as DAX like direct access. However, in contrast to real emulated NVDM, we get the benefit of flushing rights to this to actually work properly and we're going to talk about that in a second. So if we take a look at our guest physical address space, then with Rudo Keemann we would have our DAX device meaning our file directly mapped into this address space. And if we compare that to an NVDM it's actually pretty similar. So the main difference here is that whenever we emulate an NVDM using a real NVDM, there is absolutely no issue. But at the point where we would start to emulate an NVDM for our guest using a file, we would run into issues when wanting to flush rights to disk. The nature of NVDM's work by using only memory flush instructions or to, for example, flush cache lines and memory fence instructions. And once these instructions were executed, the guest can be sure that everything is persistent. But if we map a file into our VM physical address space, this is no longer the case. We really have to intercept any kinds of flushes to QMU and in QMU we have to go ahead and do an F-sync and only after the F-sync happened we can be sure that it's actually persistent. And this is very important in case our virtual machine would crash because if stuff would not be persistent on something that's supposed to be persistent memory, then we're in trouble. The big idea is to have a para-virtualized mechanism to perform flushes and this is exactly what Rudolf Thiemann does. And we get, by doing that, we get the benefit of DAX devices, meaning we can bypass the page cache in our guest completely and instead let the page cache for that file be completely managed in the hyperwatch. So what are the advantages of Rudolf Thiemann? Of course, we move this page cache handling from the guest to the hypervisor. We free up the guest page cache so the hypervisor can make decisions of when to shrink the page cache just easily, for example, when it's about to run out of memory. Also, it's a safe fireback emulated in VM because writes work probably in contrast to using a real emulated and VM backed by a file as I mentioned. So interestingly, as it's a Rudolf device, it's actually an NVIDIM-like mechanism, a DAX mechanism, even for architectures that don't even have hardware NVIDIMs or architectures that don't even have ACPI. So for example, S390X might be feasible in the future. But also there are some disadvantages. Once we map this disk image directly into our VM physical address space, we really only support raw disks for now, so no QQo2 or similar. Also because we're using the hypervisor page cache now with multiple virtual machines, there are quite some security but also fairness concerns that at least users have to be aware of. Similar to real NVIDIMs, booting is not supported and requires an external kernel or another disk image which could, for example, be read-only or similar. Also, it's worthwhile to mention that Rudolf PMEM is not applicable in all setups. So for example, there are environments where the hypervisor page cache is not involved at all. Imagine passing through a disk directly from your hypervisor to your guest or accessing the disk using some other mediated devices. Also, as soon as we have fairly big disks, this could become an issue. Also, there are still some open items to be sorted out. On the one hand, we want to eventually support other architectures but also other guest operating systems. As far as I know, current leaders only really need support for it. Also, in the long term, we want to support other disk image types. And we could actually support something like UCO2 or similar by using some neat user-folded T-trickery. But of course, this is stuff for the future and might require more work to figure out how exactly it's going to be done. There's still one remaining bug that's to be solved, which involves pre-flashing, asynchronous flashes in Linux, stuff like that. Long story short, it's work in progress, but as long as that's not upstream, there are some cases where flashes might not actually be persistent yet. Also, we want to see in the future liberate integration, live migration support, hot unlock support, and a bunch of optimizations. But until then, RIDAP-P-MEM can be used just fine, keeping in mind a couple of things I mentioned. Now, let's talk about VidoMEM. VidoMEM can be summarized as a fine-grained Numa-vera memory hot unlock mechanism to dynamically resize virtual machines. And the idea is actually pretty simple. So if you take a look at your memory, the memory your virtual machine has available, then you usually have some kind of initial or boot memory, and you can extend that memory using various machine mechanisms. So for example, you could use DIMMS to add more memory to your virtual machine or remove DIMMS again by hot unblocking them. But DIMMS have their own set of issues that I'm not going to go into detail here. VidoMEM is similar. So VidoMEM can extend then your initial VM size on a per node level. And it works by each VidoMEM device providing a flexible amount of memory towards a virtual machine. Internally, this is implemented by a device managing a dedicated region in guest physical address space. It can be thought of something like a resizable DIMM, but it's more complicated than that. One interesting fact is that VidoMEM devices are not discovered if you're running an unmodified operating system, meaning an operating system that is not aware of VidoMEM. Because that allows us to always know which memory a guest is allowed to touch, and for example, later detect malicious guests that might try to make use of more memory than they're actually allowed to. Internally, VidoMEM works in a granularity of blocks, for example, two megabyte blocks, but they can be significantly bigger. And a VidoMEM device itself has three main properties. On one hand, it has a size which corresponds to the amount of memory a VidoMEM device currently provides towards a virtual machine. It also has a maximum size, and the maximum size corresponds to the maximum amount of memory that could be provided via a VidoMEM device towards the virtual machine. Last but not least, there is a requested size, which corresponds to a request from the hypervisor towards the guest to change the amount of memory that is consumed via VidoMEM device. And this mechanism allows you to resize a guest in fairly fine-grained steps, no more. And using it is actually not too hard right now. So first of all, you have to prepare your virtual machine for memory devices just as you would have to for DINs and VDINs, but also VidoMEM. After that, you create a memory backend, which is later used to host your VidoMEM device in your hypervisor, and the size you specify actually corresponds to the maximum size. Then you create your actual VidoMEM device. You can assign it to a node and you connect the memory backend. And as a default, if you would start your VM, your guest would not consume any additional memory via this VidoMEM device. It will start in consuming more memory as soon as you actually request it. So you can request it, for example, via QMP or HMP and QMU, using Chrome Set and Chrome Get mechanism. So you could request to consume, for example, for gigabytes via that device and the guest will try to make it possible. And you can always then also observe how much memory the guest is actually consuming via device by querying its current size. So what are advantages and disadvantages? Advantages are obviously that you can resize a virtual machine in fairly small increments. So right now, with Linux guests on x8664, you can resize in four megabyte criminality. Also, it's significantly more flexible than DIMMs and also significantly more flexible than memory ballooning. For example, memory ballooning does not support NUMA, and with DIMMs you have quite some granularity restrictions. Also, but a man is able to manage VM size changes completely inside QMU. So you don't have to mess with any DIMMs or anything else. All you do is you request changes to the size of a MEM device and see what happens. Interestingly, with a MEM, being over the whole device is also architecture independent. So for example, it does not require ACPI. So it's also applicable to other architectures. Disadvantages are, for example, that it's not production ready yet. So we have some basic versions upstream and Linux, GM, you're in the cloud hypervisor, but there are still some things that at least I want to see implemented and fixed before we can consider this production ready. And I can sleep good at night. Also, it's slower than memory ballooning, and it cannot unplug as much memory as memory ballooning. The thing is that memory ballooning works on the whole virtual machine and not just on restricted physical memory regions inside your virtual machine. And memory ballooning works on 4K granularity usually, while the MEM works on four megabyte granularity. Also currently it's incompatible with hibernation and suspension, meaning as soon as you have for the MEM running with Linux guests, you won't be able to hibernate or suspend your guests anymore. This might change in the future, but might require quite some work. Open items are just as with Roto P-MEM, for example, support for other architectures, ARM64 and S390X, I have prototypes for, but of course other ones might also be interesting. Guest operating support will also be challenging and interesting, for example, to get Windows running with it. There's still quite some open items in the Linux driver, for example, how much memory you can actually unplug later on is not guaranteed yet, but it is working in progress. In GMU, there are very things that have to be tackled, for example, VFO support just as for Roto Balloon, meaning that you can path through devices and still have this mechanism to resize the virtual machines this way. Also other GMU future work would be protecting guest memory from being accessed again, meaning that a malicious guest does not consume more memory than actually requested by a Roto MEM device. Also, just as for Roto P-MEM, liberal integration would be great to see in the future, especially once it's officially production ready. So, to summarize, what we see is we see more specialized mechanisms to manage guest memory. So for example, we talked about the Roto Balloon extensions to speed up migration or to optimize memory overcommit in the hypervisor. We talked about Roto P-MEM to move the page cache handling to the hypervisor, and we talked about Roto MEM to resize your guest's fine-grained memory. But it's worth to know that traditional balloon inflation and deflation still remains important. For example, the new mechanisms we see still have to mature. For example, Roto MEM is still not production ready. Also, the more memory management intensive things we develop, the deeper the memory management integration in our guest actually is. So for example, writing a balloon driver is pretty simple. All you have to do is allocate memory and free memory essential. But having window support for all of the other features we talked about today will be much more difficult. And so it is with all closed source operating systems where we as open source developer cannot really influence core memory management. Also, in general, there is still a lot to optimize. As Michael already mentioned, the guest page cache still remains challenging. So are other caches like the application cache, if you imagine something like that. For example, Roto P-MEM isn't always applicable. And then you essentially are back to the same issue with the guest maybe consuming all of its main memory just for the page cache. Also, encrypted virtual machines remain challenging. I think this item hasn't really been tackled yet, but it's certainly stuff for the future because the hypervisor isn't really allowed to modify content of the virtual machine. So what most of our mechanisms do is they discard for example memory to optimize and that is not possible. So it requires some kind of coordination with the guest or with the encrypted VM setup. So I think that Roto Balloon inflation and deflation should be feasible, Roto MEM should be feasible. I'm not so sure about Roto P-MEM because we're mapping some content into our guest which is unencrypted and that might be an issue. Also VFRO and in general PCI path through or other path through remains challenging. The issue with VFRO is that it essentially pins all guest memory and forcing it to remain in hypervisor memory. So even if you would have Roto P-MEM, this would mean that your whole Roto P-MEM device would be pinned in hypervisor memory, which is not really an improvement to what we have right now. And the same goes for all of the other items. We do have a prototype for Roto MEM that makes it work, but I guess a clean solution will still have to require some discussions in the future. And that's basically it for this talk. Thank you a lot for attending. If there are any questions feel free to ask them in the chat or reach out to either me or Michael. I'll leave you with that. Here are some resources in case you want to learn more, look up stuff, and this is it. I hope you'll have a great time enjoying the rest of KVM4 2020.