 Hello everyone. My name is Jean-Martinche and I'm here to talk about memory efficiency and ways we could fix it. And I'll start with the short motivation with what got us here. Linux keeps the amount of all physical memory in its page, in the current page tables, which is called the DirectMap. This address page is modified in a lot of ways throughout the different OS stages throughout the OS life cycle. At boot, when initializing the memory map and all the system RAM or at hotplugging new memory or when you try to map an IO address. Portions of this address page may be baked by various kinds of metadata, which track different things at different granularities. We have memblog which describes blocks of memory. You have the underlying memory model, so vmemmap, which looks like a single contiguous regions which in each individual address points to search pages. Although at the lower end, as you can see, you have search pages. The other space, though, maybe manage what goes in the actual page tables may be done in bigger chunks. So usually it's often the case that you use two megabyte pages for the DirectMap. So the process may think it has mapped when it sets up a guest with the memory slots or any private VNM data. When you enter the kernel and you switch to the kernel page tables, you have all guest memory rightly available alongside other kernel allocations or anonymous memory. With respect to this DirectMap, I would like to highlight one particular metadata structure that is structured, which is going to be the source of most of the talk. The data structure design is largely derived by the needs of page cache and anonymous memory. And it's also a structure used by most kernel services in the most granular way of tracking memory. So the purpose of the data structure is to track references to PFN, alongside file mappings and other subsystem-specific data. You get one of those strip pages when you say use the body allocator with alloc page or get free pages, or you grab a reference to an existing page to a particular page with get free pages where you pin memory short or long term. Although the data structure has some overhead, the size of the structure is about 64 bytes. It's usually tracking 4K, although certain architectures allow this to be tracked in a bigger trend like I'm 64, let's you play with what's going to be the underlying page size. So I was saying 64K. On top of the structure, you have other overheads on top like you spend 8 bytes per APT entry, and the process page tables you spend about 8 bytes as well for each PT entry. Although these costs can be amenable when you try to use huge pages, which to amortize the debt base table cost to a great extent. So when you put it all together, we are just talking about 1.5% to 1.75% of most physical memory. Which at a first glance does not look much, but let's revisit what that actually means in practical terms. So if we extrapolate that to say 2 terabytes of memory, we are spending about 32 to 36 gigabytes of memory. So for short, that's roughly that's 16 gigs per terabyte. And if you work for slightly bigger machine like an 8 terabyte machine, you spend about 128 gigabytes of memory to 160 gigabytes of memory. These are not really crazy numbers, they're actually numbers on machines we have problems with, where a lot of this memory you're spending could hopefully be used to actually boot more guests. So if we take in consideration how, how this is where this is going and that the fact that beams are getting more dense. If you take a 64 terabytes machine to put this overhead in perspective you would be spending about 1.2 terabytes in circuit. So going to account to the recent spec, the recent vulnerabilities in hardware in CPUs, where we could speculate to be take advantage of these good gadgets, or, or you could potentially weak all the memory map by the kernel in user space. So we're exploiting through more CPU resources to lower CPU resources like the L1 cache, the L1 data cache, or micro architectural buffers. All this F1 thing in common which is given that the kernel maps everything, therefore, everything is applicable. The one I would like to give special emphasis is Spectre V1 which is hard to mitigate. And that you have just so many code gadgets you need to hunt and every new merge window adds potentially new code gadgets that you could exploit. So the main premise I'm trying to raise here is, can we do better for hypervisors? The problem, the problem you see is that given struct space does not really reflect what goes in the page tables. And if you look at more modern hypervisors, we won't be needing the majority of the kernel services say if you're just doing CPU, memory, and PCI devices. On those circumstances, we are essentially we essentially losing a lot of efficiency to what represents a largely idle infrastructure throughout the guest's lifetime or host lifetime, while potentially unnecessarily mapping all customers data when we probably don't need to. So the first step towards fixing some of this led us to what happens if you try to remove struct page. First, let me describe what today is one way you could you can do some of this. And that's through depthman and man equals x. Essentially, you do as a user is you specify man equals and some amount. And that's going to be an amount which you're limited to by the kernel. And you have the special device called that man where you can map every, every memory on the system. So a couple of problems with this that I would like to enumerate. First and foremost, when you specify man equals x, you have no way to characterize where exactly you want to take that amount from. So you potentially either restrict one to the first note or straddling all notes to fulfill that parameter. So you do have a mechanism that do not necessarily have certain pages that memory, and but you're limited to a single contiguous chunk. Also, you can only map. You can only memory map that memory in 4k page sizes, you have no two megabyte huge pages or one gigabyte huge pages, which then goes to my next point. So you have a great fragmentation across hundreds, thousands of guests creations in terms of your potentially didn't need the fragmented systems. So, in addition to not having huge pages which sort of amortizes some of that. You want the ability to pick holes free, free holes you have within allocated chunks to accommodate the given allocation. In order to do that with that man, you need to map several times that man and space and use different page offsets. But you still need to give them access to a given VMM, which then, you know, breaks a little bit. And that is where VMM runs a potentially the privilege environment, and therefore should not have been able to map any memory on the system. And finally, you don't have a way to give some of that memory back to the colonel see or try to rescue the host from our memories situation. All these problems. Let us look at tax, which is the other mechanism. The main purpose is to give you direct access to memory. It's bigger consumers PMM, and the interface behind that device tax is very simple. It's a character device, which you instrument to see the first you create to see the first. And which just led to in the kernel memory map, a given chunk of memory application is a given memory maps to all this metadata is created. The application has control over how the memories map, like in 4k pages, two megabytes or one gigabyte pages. And any exception you have a like mcs and x86 are for work by three application, you get a signal. And final, finally, you have memory, you are mechanisms to return that back that memory back to the kernel, say with DAX, came out driver. You can emulate some of this with men map option. Although the problem with using this option is that you give that option has a lot of power into messing up with your memory map. So users really need deep knowledge of what your hardware memory map looks like to be able to pick an actual run range. So what I want to clarify here is that usually people confuse cold decks as one thing, but there are two kinds of taxes and there's the PMM so there's device tax, which is this very simple device. There's PMM which is about device and there's file system decks, which purposes to bypass the page cash. So these are all three different things and the one I'm emphasizing here is DAX device tax. I think there is a couple of problems with device tax. And it's largely derived from its biggest consumer which is PMM. And within persistent memory name spaces are not supported, do not support these contiguous regions. They support only you can only have an M space with one contiguous chunk. And that because you need to initialize all these many straight pages, you have long initialization time of your DAX device in bringing it online, because you need to clear all that memory. While you can represent huge pages in the page tables. The way these pages straight pages look are not, are not the same as say transparent huge pages or you should be chlbfs where you would have a head page, and a couple of telepages represent two megabyte or one gigabyte page. So these, you got to look at the page tables to understand whether a given strike page belongs or not to a huge page. And finally, you need architectural support for DevMap, which is the kernel way to tell that this device this particular page. Basically a PFN belongs to a particular device map, and therefore a zone device special zone in the, in the kernel. So, the main question we got what we got into was how we could repurpose some of this device that's already provides you everything that we need. We could repurpose some of this while fixing, making some improvements to repurpose this to volatile memory. With that, let us look to DAX HMM, which is the driver, which can be used for performance differentiated RAM. We can move this map option from here and we instead use EFI and we added the FI memory map, such that we mark RAM ranges with the FI specific memory. So we need to care in about RAM ranges and you don't need to understand how exactly is firmware exposing everything else is not RAM. And we essentially have an ability for firmware to dedicate memory to user space when memory start with the specific purpose attribute. And that means that the kernel is going to create one DAX device and give that to user space as a memory mapable DAX device. So we essentially all have to, we had to fix is, you have to just support these contiguous regions where we try to pick all free ranges to accommodate the given allocation as opposed to those contiguous chunks. And we, in that helps tremendously with dealing with fragmentation. And then the way you allocate this you can either give the application control over what ranges to pick, or you can reserve to the DAX mediation where there's a simple range allocator where it just ranges or allocate new ones to fulfill the allocation provided by the user. The fact that you specify mapping is especially useful for use cases like PMM Live Restart, KMU Live Update or KVM Live Update where you want to preserve the exact same ranges while not scrubbing that memory the next time you map it again. And so I like to defer to Jason's and Steven's society presentations which cover a lot of why I refer here. So the next step was then obviously to remove struct page from device tax. A lot of the bigger infrastructure work was for repurposing DAX. And so fixing this, this contiguous limitation. And all that left remaining was to have a page less memory map and we still keep the same properties behind DAX. So static PFM mapping for a given VA range. And so you still know what you will know at device creation what VA is going to be mapped to a particular PFM. And essentially the VMA type is going to be essentially PFM map which in Corvian Corvian means that I have no struct pages. We leverage a lot of the work that by Karim has map where it introduces an alternative guest mapping series when memory is not picked by struct pages. And we had to simply fix, not in KVM, we leverage a lot of that works we have no changes specific to DAX or anything was mostly bug fixes which are general to the usage of PFM maps. But we had to support huge pages for page special page special is how the kernel says this memory does not have a script page. And finally, we had to fix out memory failure as the kernel bills out early. We had to fix out memory failure as an MC on memory it does not track. And we had to reflect what's what the actual cashier property is for run and so be able to map it as right back as opposed to unhashable. But that's nothing really different than it's not done for that man. And again, there was no logic specific to DAX to make this work. And then to the previous diagram I explained earlier where we will have all the memory type in direct map and, and what we're essentially doing here by removing struct page we gain this memory efficiency back and remove other guests memory from the direct map. So less subject to pickage. In practice. What we do is you specify this if I fake map. The option, as you can tell from what it describes, it's not really intuitive so they still work there to make this slightly more user friendly. But what we are essentially describing here is that my hypervisor is going to have 16 gigabytes per node available for user space kernel manage allocations, or and you associate the rest for DAX. So DAX has 368 gigs per node so you essentially bring up two regions, one per numeral. On a proc FS this appears like soft reserved, and you then supposed to use the DAX tools to instrument this region or you know you cannot also come up with your own tools which uses the CZFS API for the purpose. You can then create various devices with you know a 30 gigabytes guest with a given huge pages, and you select which region you want. And then optionally what we are trying to do is which pages memories that you pass on this no metadata and you're not going to create strike pages for these devices, which also tremendously speeds up the bring up other device. You then use this like any other regular file bit memory, and not there's nothing really different there and it's the same for UCLA FS with any other shared memory mechanism. So the use case that I see for this you could let the KVM bind to these devices similar to what we do for DAX came out where we give back memory to the kernel. But here, KVM would use it to back some of those data structures used when doing work on behalf of the guest, such as we would be hiding the VCP registers, or KVM I will pick in there is a couple of call sites I'm just this is just for example, find purposes does not need to be pageless so long as it's not part of the direct map points. I slide it in some form, but this would be one way to implement a poor man version of process local memory which was a sub point proposed by some of the as folks. The user space. Another use case could be to use this memory for any other VMM allocations. And it could well serve as a memory pool as opposed to resort to anonymous locations to recap on some of the advantages by removing so page you sort of kill two birds in one shot, which is to get a ton of memory back that is being lost in search page. And, fundamentally, because you're the kind of does not have the memory that that same customer data is less prone to leakage by other guests. The use case is by preserving that memory across hypervisor or VMM live live update are more easily done fundamentally given that how backs works and gives that control to the application. And hunting down spectrum and the gadgets, especially those done on the context of guest memory gets a lot more easily mitigated. But there are pitfalls in doing this approach as well. And that means that once you move straight page. You're on your own. And so subsystems don't really work well without it. And you're largely losing certain kind of services, given that you don't have yet to use your pages and use your pages and so on and so forth. So, an easy pick is that the retail and zero copy networking I don't work. For example, if you do some message message zero copy, or if you use all direct. You, given that get these are pages do not return you any actual strip pages. You know, you will say the, the, the IO. This does work for certain specialized cases such as the case of KVM, or if you do basic PCI assignment, but even there there are some issues in which you need to, in addition to use for pfm you're expected to track page table entry updates. In your secondary mapping. So, usually need to register some form of memory notifier. In addition to users copy fm KVM does it right. But other subsystems would need so so if you're mapping if we're giving a device to the FIO. It does work today but if you invalidate one given the range. You may want to reflect that into the IMM you and the line that is IO does work, but again, is limited to copy based, which is also the default in the host net. Because the MM owner is the same as the VMM is the VMM. So, we just had to come up with a little trick with the host SCSI for mood storage where you allocate staging buffers for drive some of that. IO. There's a big, there's a big drawback which is losing current services. And so what are the directions we are looking at here. This goes, there are two approaches another long term approach we are looking at which is the SI takes a safer approach into securing a greater portion of what KVM is handling versus this approach of removing structure page which is the opposite which is, you're trying to protect certain components of memory, but I believe this could work in concert and so you could use this mechanism to say protect customers and using SI to protect VMM and kernel private data. That's a better catch all to what's going to be a number of allocations on behalf of the guest. I think the pages could also serve as a performance improvement. Say if you're not exiting to user space, would you need, could you leave some of these mitigations say the NDS flush. If you're not exiting to user space, could that serve as a performance optimization. Also, the larger problem at hand here is that huge, we need to work better. The pages need to reflect better what the underlying size and the page table, or the alternative is to have subsystems work the rest of pages. One good example is the large page in the page dashboard is that you only look at the page had pages to compute any address. We do want a particular address and we don't need to use all those tailpages in that sort of becomes an implementation detail from subsystem or from user perspective. So get to the pages, for example, we'll just return you at pages and no tailpages. Just more easy example. But there's also an interesting approach which I thought I'll mention that happened like a month or so ago. And that is, for external allocators like HCLBFS and DAX. I'll recommend that DAX also use for PMDs. So this is not only applicable for this, but we're also persistent memory. But when interesting question raised by the subnet folks is, what happens if portions of the viewman map, we use the same tailpages. Right if all the tailpages could use the same big memory provided that you represent in a subset of unique search pages all the information that you need for a two megabyte or one gigabyte page. What happens if the remaining ones are not needed and they all point to the same memory. That would mean one thing. You need less memory to make those straight pages. You still have those straight pages. So it does look like that you have one uniquely to every 4k chunk. The others pointing to the same memory. If such a mechanism is possible. That would be applicable for HCLBFS and DAX which preallocate and reassign chunks at boot or rather can be assigned chunks at boot. And if DAX had support for these more page compound pages, it would also serve. It's also could also fix other problems we have for persistent memory. We can also fix where we would pin faster or initialize quicker some of these DAX, some of these namespaces. These are also something we are looking at at the moment that hopefully we can have an update in a few weeks. So we get all back to conclude. For 5.10 is going to have a lot of these repurposing of DAX for volatile memory and provides a way to carve out struct page. Which fills up fits a given use case when your app advisor is not using so many does not need to provide so many kernel services. DAX each pages proper support is going to continue and we are looking at alternatives such that you don't have such a big compromise into having giving away so many current services. And what I was trying to propose here is to have sort of a harder boundary between what's hypervised and what's guess and what's guess or customer data. And at least unless I'm not for me was that I'm going to give away such accorded structures as a sub page. It was interesting to know that it, not much is needed other when your hypervisor doesn't need to provide that much services. And with that, thank you for listening to me some links here of some of the work I'm talking about. Thank you, Matthew, my crevets, we ran alone and it as they all were part of this work. Thank you.