 Good day. On behalf of the Overt KVM and ZUN projects, I am pleased to welcome many of you to the virtualization and infrastructure as a service dev room. Thank you for coming this afternoon. Quick housekeeping reminder, if you do have to leave before the end of the session, we invite you to go to the back exit so you won't disturb the speaker area if at all possible. I've been asked by the speaker to ask you to hold questions until the end. He has a lot of material to get through. And hopefully, there'll be time for questions at the end, if not, we'll figure something else out. So with that in mind and without further ado, I'm pleased to welcome Andrea Archangeli from Red Hat to our dev room today. Thank you. Thank you. It's a pleasure to be here. So today, we are going to talk about the evolution of the Linux virtual memory subsystem since the last about 20 years, since it started about 1997. So then we see some of the decision behind the KVM kernel virtual machine design and how actually it integrates with the virtual memory subsystem of Linux. And then we also see some of the virtual memory latest innovations, like automatic NUMA balancing, PHP developments, and very recently user fault FD, which are kind of Linux kernel feature, which can be used regularly for virtualization, but especially to help when you are actually using Linux as an hypervisor with KVM. So first of all, what is virtual memory? Virtual memory are pages which effectively cost nothing. So that's where your program is running. And it's practically unlimited on 64-bit architectures. And the arrows you see, the red arrows you see in the middle, these are the page tables. So when you allocate memory, you allocate memory, which is virtual. And then Linux will decide where to put this memory. And you don't actually know. So the memory at the bottom instead is physical pages, these cost money. These are actually the deems you put on your computer on the servers. And these arrows are actually implemented as these radix tree, which are actually the page tables. This is showing only two pointers. In reality, in the X86 architecture, X8664, there are actually 512 pointers. But they cannot show it or it will get too big in the chart. And all the page tables are 4K in size, 4 kilobytes. And you can see the total amount of memory I used in page tables in your system with grep page tables and slash property memory info. And these page tables are a little costly because you need to allocate them. So they are not entirely free. And on the X86 architecture, because of the page table layout, which is actually enforced by the hardware. So this is a structure that the CPU will work in hardware. So there's no software which actually reads these. It's the hardware which actually reads these. Linux really too, but it's mandated by the hardware, this format. And this format is also enforcing how much virtual memory you can have in the X8664 architecture. If you do the math at the bottom, the result is actually 48 bits. And its base don't see this format. And there are 47 bits for the user land and 47 bits of negative address space for the kernel. The kernel takes all the negative address space. And to understand the virtual memory subsystem and how it evolved, it's important to understand effectively what they call the fabric of the virtual memory. And the fabric are all the collection of data structures which are used by all the kernel algorithms which work on the memory management. And these obstructions are like tasks, processes, virtual memory areas, called also VMA, which stands for virtual memory area, of course. And M-map, G-Lip-C, and Maloch, and these things are all based on the fabric. And the fabric is also the most black and white part of the virtual memory. So it's easier to show. Because the algorithm which actually computes on the data structures often see our heuristics. And the heuristics are difficult to explain. And they also need to solve problems which don't have the perfect guaranteed solution. So for example, what's the perfect time to swap? What's the best page, the best candidate page to reclaim? All these problems don't have a perfect solution while the data structure are actually black and white. So it's much easier to explain. And also in Linux, we make heavy use of overcommit. And overcommit is enabled by default. It's not excessively overcommit. So if you're building a Linux server and you try to locate one terabyte and you have one gigabyte, it will return, no, there's no memory. But you can actually make it return even one terabyte by just setting equal one in overcommit memory in this file at the bottom. And if you do that, you can actually locate as much virtual memory you want, just like I said before, it's free. And actually Android does it. So it's not so insane to do it. Tensory is also echo 2. If you do echo 2, you will actually enable strict commit checking, which means you're not doing overcommit anymore, which means you need to add a lot of swap if you want to locate more virtual memory than you have in RAM. So generally overcommit saves a huge amount of memory. So it's a very good technique. So the best way to start to explain the fabric is to understand the page structure. So every page in the system, so imagine your memory divided in small pieces of four kilobytes each. And every one of these pieces need a structure to describe it, which is the page structure. So this will be the size of the full four kilobytes memory. And a few bytes of this memory are used as a struct page itself, which is 64 bytes. So every four kilobytes, there are 64 bytes used just to describe the memory itself. So if you do the math divide 64 by 4,000, it means on 6664 architecture, 1.56 of the memory is wasted. It's not really wasted. It's used for managing the memory itself. And this is allocated as an array called memory map array. Mem map recently is also zone pointer. Mem map, it's a little more advanced now. But it's still an array where you have a page structure for each actor page in the housing. And so we are always strict in terms of flags. And basically, this page structure is encoded very efficiently, so most compressed. Because if we were to add just eight bytes to this structure in the kernel, it's a kernel structure. You can check the kernel source. It's a structure like everyone else. But it's extremely important not to grow it, because adding eight bytes globally in the world with a billion of Linux devices would waste petabytes, dozen of petabytes of memory globally. So it's extremely important not to grow this structure. And we do all kinds of tricks not to do that. We're constantly running out of flags, like I said. So the other important structure are the MM and the VMA. So the MM is describing the process. It is effectively the memory of the process. So each process has a single MM. And if you have threads, they share the same MM. A VMR structure, instead, is created and turned down by syscores like M-up, M-M-M-up. So effectively, when you allocate memory, be it with MLock, be it with new, be it with Python, new object, or whatever, generally, a VMA is created or enlarged. So this is described effectively as the layout of the memory in the process. And, of course, VMA is linked into the MM. So the MM is the process, and the VMA is the structure of memory in the process. So let's try to describe how Linux was when I started. So that was kind of 2.0 and 2.2. So at the time, we just had the MM app array, which is what they just described, a struct page for every page. So it's an array, but it's simple, with all these struct pages which describe each page. And when you wanted to free memory, like when you add the cache, we already had, of course, I mean, Linux was already pretty efficient, we already had the page cache. In fact, we had the page cache well before order. But if you wanted to free a page, you would need to scan all the pages in the MM app array, including kernel memory, anything. And check every one of these pages until you would find one, which looked like potential cache candidate, which was not mapped. And you could actually see, well, nobody is using this page. It's cache. Let's free it. This was called a clock algorithm for pager claim. And there was also another issue. The cache is not always unmapped. So when you read the file, you just generate cache. You don't map the cache into the memory of the process. But sometimes you just do a map. And for example, all the executable in Linux are some cache which gets mapped. And that's the way the binary loader actually loads executable into memory. But databases can actually map a file into the memory of the process and use this mmapped.io. And when the memory is mapped, you cannot just free it. Because there are page tables which might be pointing to the cache. And so before you can actually free the cache, you need to get rid of the page table. So in the 2.2 kernel again, the only way to free the page, the mmapped page, would be to scan all the page table in the system. And there can be, again, a lot of page tables. Because as I said, the virtual memory is free. So there can be many different virtual addresses which point to the same physical page, like in this case here. So it would be very computational inefficient. It's not a fan algorithm which requires to scan effectively all the page table in the system before you can hope to free a single page of cache. And then you add another scan here to actually find the cache. So again, this was not scaling. So in the 2.3 kernel, we introduced the last lesson to use at least, which right now sounds so simple, but back then it didn't exist. And we effectively link it together, all the potential candidates. And we created also a LRU order for that. So instead of just scanning blindly in a loop over an array, now we actually can keep an order. So the last one cache which got hit and reloaded it, it goes in the head of the list, and we shrink from the tail. So see this basic last lesson to use the list algorithm where you have the head. And again, all these caches are being linked together. And then we went ahead. And we introduced in the 2.4 kernel an active and inactive list, not anymore just a single LRU. And the reason is because a single LRU can not detect when you have a certain working set. And the basic example is when you do a backup. When you do a backup, you have streaming I用 that gets read only once or written only once. So there is no cache yet. So if you were to run a backup with a single LRU list, quickly the whole LRU list will get destroyed. And the cache will only contain the data of the backup, which is useless because you only access it once, maybe once a day. So the active list effectively allowed to keep a working set alive in the cache, even thought the backup would roll through the active list. So the idea here is to keep the data of the backup only in the active list. And since there will be an activation, so when you would get a cache sheet on the active list, the page will be moved to the active list. And there is also something which at the time was called the refilling active, which is a way to keep the active and inactive list balanced. But a very good algorithm to keep these things balanced does only be implemented two years ago in the upstream kernel through a shadow entry. This is in a later slide. So at the time, we tried to keep a similar size between the active and the inactive list. Again, it was an realistic and worked, but now we have even a better one. So this is the same chart as before, but with two different lists. So the idea is the active list at the working set. And the inactive list would contain only the once used the pages, which effectively are trashing the cache. Ideally, we wouldn't want to cache them, but again, we need a way to detect what's stuff which is useless to keep in the cache and what's actually part of the working set. You can see the active and inactive list in slash proc mem info. And of course, now there are more than just two. There is also one for anonymous and file back and mapping. So these days are, you know, we added more early lists, but the idea works still better, you know, and detecting the working set. And so what's the next problem? Well, if you look at the previous chart, here we got things more efficient, so we didn't need to scan the whole memory anymore to find candidates to reclaim, but whenever a page table was mapping one of these pages, we still had these clock algorithm all over the page tables. We couldn't just get rid of Cesaro. To get the rid of Cesaro, we would need to scan the whole other space of every single program running in your computer, and that's not going to scale. So you have to keep in mind that the time, the machines had very little memory. So we had 100 megabytes of RAM, and it didn't take that long potentially to scan the whole thing. These days, it will be unthinkable. So the next step is the 2.6 kernel, also 2.5, but the final version was 2.6, was to introduce R-Map, which means reverse mapping. So if you remember, I showed you before the page tables, which effectively decide where the memory of your program reside into the physical memory. So you run your program in the virtual memory, but the page table decide which page goes in the physical memory. And let's assume you want to get rid of Cesaro because you think Cesaro's best candidate page is at the very end of the inactive value, and you need to freight. So to freight, you need to get rid of Cesaro's. And so to get rid of Cesaro, you need to update the page tables, and the R-Map exactly implements a structure. It's a software structure that provides you a way to reach all the page table, which can possibly map this page. So the R-Row going down is the page table. Cesaro basically work it by the hardware, it's an hardware mandated structure, but the R-Map is completely a software thing. It's just for us, so we can reach the page table and clear it. Once we cleared it, we can get rid of the page. Because after we cleared it, or actually better, I should say, make it not present as a swap entry because then we need to find where the data is in the swap device. After we effectively made it not present, the next time you access the memory, you will get the page fault. And the page fault, we will do the swapping from disk. So after we actually removed the page, the page tables, we can free the page. So the object reverse mapping is made in a way that the single object can be shared by multiple pages and the single object can actually reverse map a huge amount of memory. We want to be efficient. We don't want to allocate the single object for each one of the page tables that we need to reach. The only case where we have to do it is with KSM. And I'm not going into the sale of this one, but KSM still has ways to limit too long error map chains. So there's also patch queued in MMM right now to do that. So this is a very efficient way of doing error map in Linux. And back to the previous chart. If we introduce error map instead of the clock algorithm, things start to look very efficient now. Because when you want to get rid of, let's see the tail of the inactive error view, which will mean this one. Let's assume we want to get rid of this one. You just use the error map to reach the page table. You invalidate the page table and that's it. Then you already know which page to free and you also have a method to free. And so see things scale. And that's the status we have since the two to six kernel. So since four, actually no, two to six, 30 something, we got over time introduces this new way of detecting the working set of the process. And the way to detect the working set effectively consists in storing inactive age and keeping track of something called inactive age. And the idea is effectively to be able to say, for example, this is the active list and this is inactive list. What if we would shrink two pages from the active list here? So there will be enough space to cache the whole thing, including AB, which currently don't fit. So if we have a way to tell that the default distance in this set is smaller than the default distance in this set, we can tell, well, we should actually shrink the active list more aggressively because if we do it, we will be able to activate a huge amount of the inactive list. So the whole point of this algorithm is to decide where to basically divide the inactive and the active list and move it more towards one side or the other side. So effectively to grow or shrink the active list dynamically depending on the working set. The way to do it is very smart. And it consists in keeping track of these inactive age and default distance in the Radix tree. The Radix tree is a tree where you basically describe the cache belonging to a file. So you have a file and each offset of the file will have an entry in the Radix tree. Generally this Radix tree is used for lookups to find the page. So effectively it's the way you look up, check if the information you are going to read from the file is already in the cache. And if it's already in the cache, you will find a page. If it's not in the cache because it was reclaimed, you will find an exceptional shadow entry with the inactive age of the time plus you increase the inactive age. So effectively it's this trick allows to optimally size the active and inactive list. Like I said, this is happening about two years ago. And it keeps going to be in development by the way. And in addition to that, multiply everything I just said many, many times for many C groups. So now probably I cannot chart it anymore. It kind of gets too big again. But every memory C group has its own LRU. And all the algorithm I described will work within the C group. So within the container. The container will use memory C group. It's not enforced, but it's optional. And you can do that. And traditional optimization, like for example, these days the anonymous memory, for example, if you don't have as well, the anonymous memory cannot be reclaimed. And so we don't add it to any LRU. So we actually keep it in something called an evictable LRU. So we have again many more optimization, including the THP optimization. THP is huge pages, we'll see it in later slide. Which increases the scalability of the LRU by 512 times because we are going to have a single entry which described two megabytes, not 4K. So by having fewer entries in the LRU, things get more fast when we reclaim it. So many of the things which happened recently, like NUMA, automatic NUMA balancing, transparent huge pages, KSM, and even something for GPU called HMM, heterogeneous memory management, which effectively allows to compute in the GPU memory without having to invoke anything in the driver of the GPU. So the Linux kernel is able to move transparently the memory from the main memory of the computer to the GPU memory, which is much faster, of course, transparently. You just compute on the memory whenever you start the computation with the GPU, with OpenCL or whatever, it's going to fault and move the memory into the GPU memory. And then when the main CPU access it, it's going to bring it back to the main memory. So all these things, as you can see, is the trends where the kernel tend to optimize the workload for you without manual turning. And you can see it in all these features. And all the optimization wherever can be optionally disabled because, for example, with automatic NUMA balancing, automatic NUMA balancing is very good but automatically converge every workload in a different NUMA node. So, I don't know, you're running a database on one side, a virtual machine on the other side. Automatic NUMA balancing will detect the workload and move the stuff in each node separately. But if you were to be sure that the workloads are being run in different nodes, you would still need to use hard bindings. So, the idea is, if you want to go the extra mile, you can optimize things yourself still. But the idea of all these algorithms in the kernel is that you shouldn't need to do that to be very efficient. So it shouldn't be much different between manual optimizations and what the kernel can do automatically for you. That is our objective. So, how can we use all these features in virtual machines, like to run a virtual machine in an hypervisor? Why should we reinvent anything? I mean, we already have virtual memory, scheduler, and in fact, we don't because we use KVM. And the whole point of the KVM philosophy is to use the Linux code as much as possible. All these things I described work in the host but they already optimize the guest. So, we don't have to write anything at all. And in fact, the normal balancing was as important for other applications as for virtual machines. Not much difference. Transparent huge pages give a bigger boost for virtual machines and for normal application, but normal application gets the benefit too. A lot of things like driver and power management, definitely we wouldn't want to write any driver at all. And so, things integrate well into the existing infrastructure with KVM. It's just a kernel model plus some notifier. In some case, we need to hook into the existing kernel code to do a few more optimizations, but it's not the norm. For example, transparent huge pages, automatic normal balancing, don't have any hook at all. Just run the thing, like if KVM was a normal process, it doesn't see the difference that the virtual machine, the algorithm in the host, work exactly the same way. So, this is just a chart showing virtual machines running just like ordinary processes. The difference is when you have a virtual machine with KVM, it can also switch to guest mode, which an ordinary process wouldn't do. And normally, all the legacy general applications use only user space and kernel mode. With KVM, you also switch to guest mode. And there are lightweight switches. Sometimes, you can just keep computing here. So, you only go from guest mode to kernel mode. Sometimes, you have to go down to PMO because you might need to do emulated IO and the driver of whatever is doing the guest actually is in the PMO and not in the kernel. And then, I'm going to show a few benchmarks about NUMA balancing. So, like I said, NUMA balancing is included in the REL7. That's the first release which included it. And before REL7, you had to use a hard binding. So, you could already optimize for NUMA, but it would be absolutely not automatic. And, of course, the comparison, the interesting comparison here is between hard bindings. Because, like I said, even with, before an automatic NUMA balance, you could already optimize the workload for NUMA. But you have to use all kind of hard bindings with NUMA control, with MEM policy, with M bind. So, there are many schools which you can use to already optimize for NUMA. But the idea is this is difficult. It's not flexible, especially with virtual machines. You want to start new virtual machines all the time, shut them down. They need to move from one node to another node. You don't want to do all this management by hand. In fact, we also implemented something called NUMA-D, which uses the hard bindings, but it does it for you. So, as an administrator, you don't have to do all this binding and the management of where to run the workload, in which NUMA node to run the workload. And keep in mind, every server with two sockets today is a NUMA machine. So, if you want to run optimal, you would need to do this stuff without automatic NUMA balancing or without NUMA-D. So, our idea is you should perform automatic NUMA balancing almost as fast as an optimal hard placement. And as you can see, the blue line is the hard binding. It's the fastest. And the red line is automatic NUMA balancing. So, the kernel, the kernel intelligence in the VM effectively figuring out the best way to run the workload, where to put the memory, where to put the process in the CPU. And the yellow or orange, I don't know how it looks there, but is just a standard without automatic NUMA balancing. Like, with basic, it already had, I mean, NUMA always had some NUMA bias, but it was very short-lived. So, it kind of worked for GCC, which allocates the memory usage and freeze it. Then, Linux was already kind of okay before automatic NUMA balancing. But for all long-lived allocation, like virtual machine, virtual machine, you start it once and it keeps running potentially for one year. And for that, Linux was not optimal before automatic NUMA balancing. And as you can see, at the very top of the line, hard bindings still give a little bit of a hedge. So, it's worth it for very complex workloads, but especially in smaller systems. So, you would use only a more limited number of nodes and the instances of the database. The performance is almost identical. So, this is very good result. And that's why, of course, automatic NUMA balancing is enabled by default in RL7. So, something important here. You need to know how to enable and disable this. And that's the slide showing it. First of all, NUMA control dash dash hardware shows you if your hardware where you're running your software is NUMA or not NUMA. So, if it's not NUMA, it will complain. If it shows layout, it means you're running on a NUMA system. So, if you're running in a NUMA system with the slash proxies kernel NUMA balancing, you can enable or disable it by echoing a one or zero into it respectively. That boot you also have an option. So, if you want to go in the group command line, you can do NUMA balancing equal, enable, and disable. Then, we see some other thing, which I think is interesting about huge pages. Huge pages is a relatively recent feature. It's not so recent, but it got a lot of development recently. For example, in the 4.8 kernel, we just merged the THP and TMPFS. So, original THP, transparent huge pages, was only an anonymous memory. But since 4.8, you can also use it as shared memory. And they say TMPFS, but it works for everything, like IPC, shared memory, system five, or a map shared, slash divs, slash zero, or kind of API which can generate shared memory, including memfdc score. The whole point of transparent huge pages is to draw a layer of the page table. At least that's the way it does it, you know, the X86 architecture. In other architecture, it's a bit different, but still the whole point is to make the till be means faster. So, it means when you access the memory, it's going to run faster. So, it's like spinning up the computation of your CPU. And the benefit of huge pages are especially visible in virtualization environment, which will be seen in the next slide, why? But there is some cons. And the cons are generally about the cost of the page fault. The page fault is going to cost more because you're not going to allocate more memory. And before you can map the memory in user land, you need to clear it. We cannot show previously user data to the guest, to even a process, to anything. Every time you locate memory, it's always shows up at zero. Because we clear it, it's a security issue. Why we do it? So, because we have to clear two megabytes, not 4K, that might be a little slower to run on the fault. Of course, it's slower. So, there is also an higher memory footprints on time. And generating huge pages also takes time, more times than locating a 4K page. That's called a direct combustion. And this is a chart showing why transparent huge pages improve performance for the most part. That's a lot of reasons, but it's also showing why in virtualization environment, CIS is going to make more difference to have transparent huge pages enabled or not than on the bare middle. CIS is the number of memory access the CPU has to do to reach the actual data starting from the virtual address in your application, which is virtual, not physical. So, when you use EPT it needs to do about 25 accesses and don't remember exactly the number, but you have 25 in the chart, the chart is accurate. And if you enable PHP, both in the guest and in the host, you are going to drop it to like 17 or something. The way to use transparent huge pages basically doesn't require any change in your application. You just allocate a piece of memory, bigger than two megabytes, and you're going to get transparent huge pages on it. So, it's going to do the same thing you would be doing earlier with huge TLBFS, but without having to use huge TLBFS. It's entirely transparent to user land. So, let's get to the interesting part. So, you will see some application, and I don't think you have the time here to go into why, that are running slower with transparent huge pages. One example is RADIS, and sort of a very good reason why it's running slower, and see a recommend to disable transparent huge pages for RADIS. Well, actually, for RADIS, it should be using PLS TL to disable transparent huge pages for a single application. And generally, in most cases, so RADIS is really a case where it makes sense to disable transparent huge pages, but in all the other cases, it generally does not make sense. The only thing which makes sense, which is actually the new default, and I'm not sure if I fully agree with the new default, I prefer always, but still, if you have any slowdown, the idea is you should not disable transparent huge pages which practically never slowdown performance, but you should disable direct compaction, because the only thing which is costly is generation of the page. Actually, the clear page is pretty fast. To make a byte actually fit in the CPU cache. It's not so bad. So, the idea is if you have regressions, before you try to do, from always to M-advice, in the actual main knob, so in the main transparent huge pages enabled, which is really turning off the feature, first you should do from always to M-advice into transparent huge pages defrag. Defrag is effectively only controlling how aggressively and how much CPU time you're going to invest to create a huge page. And it depends on the workload, what is better. If the allocation is long-lived, it's always better to do always. Even if it takes time to generate the page, then you're going to use it for a long time and it's going to run much faster. If the allocation is very short-lived, it might be faster to use 4K pages and not spend the time to create a huge page because you're going to free it immediately. And unless you compute a lot on the page, you're not going to get much benefit from having drop it's this layer of page table and using the huge still bit. So the last thing, and I have three minutes for this one, is memory externalization. Memory externalization is a concept where you effectively put memory where you are computing in a different computer. So it's not swapping, it's literally giving up the memory to somebody else while the program is running. And when the program is running and it's actually in the memory, you drop it from your computer, it really should do what they call user fault and the user fault will bring the memory back into your local memory so you can keep the computation. And PostCopularMigration is a subset of this idea. PostCopularMigration effectively allows to run a virtual machine in a destination node while the memory is still aside on the source. So I implemented this user fault Cisco which is used in the QEMO current PostCopularMigration implementation which is already in QEMO upstream. If you try to get upstream of QEMO, it's already available. And the user fault of this goal is also available in the REL7-2 and it's possible to do the PostCopularMigration in the latest REL7 as an option. So the user fault latency is similar to something you can imagine it like swap and this is a chart showing during PostCopularMigration what's the time it takes for the guest to access the remote memory. And again, it's not very different from swapping, we are talking about 17 milliseconds. And this is on a, of course, in a 10 gigabit ethernet so it's not even the fastest possible. And you might be asking yourself why sometimes it takes so little because there is also background transfer. I mean, it's not like you're waiting to hit memory which is missing. In the meantime, while the guest compute in the destination node just keeps sending all the memory in background. And sometimes the memory is scamming while the user fault happens and you don't have to transfer it. You find this already there. So it's like a false positive fault and it's getting computed immediately like in less than one millisecond. And it's also frequent because we do read around. When we get the user fault we tell the background transfer to keep sending from that address, from that piece of memory because it's very likely that the moment the destination guest is waking up again the VCPU in the destination returns running it will touch a piece of memory which is very close to the one which triggers the first fault. So we have all kinds of optimization to maximize this arrow. Of course, we like the fact often is fast. This is a chart showing a comparison between pre-copy and this is a performance database. So this is before it's like three different trends before the test, before the live migration and after the live migration. So you see pre-copy as a regression in performance for the world duration of pre-copy and this is actually post-copy live migration and this is going to be the last slide because I'm running out of time but you can see post-copy live migration brings the performance back very fast and especially it can finish the live migration with the previous case didn't. So that's all and if you have questions I think we ran out of time we're going to do it at the obvious both I think. Thank you very much. Thank you.