 Okay, can everybody hear me okay? Perfect, so sorry for all that you have to stare at a mask, but it is what it is. For those of you that don't know me, my name is David. I'm with Red Hat and today I'm gonna talk about page table reclaim. I'm not necessarily gonna explain like a real proposal that I have, but because it's complicated matter. Like I wanna raise awareness and maybe we can have a discussion of what might work, what might not work and what the obstacles are of the whole thing. So first of all, like I'm gonna discuss like what we're dealing with and where it especially applies. Then I'm gonna share some fun with page tables, meaning like how you can easily have processes allocate a whole bunch of page tables even in setups where memory over commit is supposed to be disabled and there should not be an out of memory killer. Yeah, sapping your processes. Then we can have a discussion. I have some points on the slide, for example, what do we actually want to reclaim in which scenarios? When do we want to reclaim? How to reclaim and of course how to, we can actually win against malicious processes that we want to hurt ourselves. Kill it. Kill it with fire. Right, let's talk about reclaimable page table first. Just a quick recap, I assume most of you guys here know what I'm talking about. For user space processes, we have these nice page tables. The layout and how they behave are various depending on architectures or configurations. So it depends how many levels we can have if we support gigantic huge pages, if we have transparent huge page support. But of course also what the page table layout is. So for example, if we have the same layout or how a page table looks like on each level, if they differ. And of course we have different base page sizes. On top of that, we have various different corner cases like continuous PD, huge pages, which is weird. PD map, THP and a lot of more weird stuff that over complicates things like page directories for huge pages, which is not a page table, but something in between. But for this talk, I'll be focusing on I guess the x86 layout, which is a base page size of four kilobytes, a page table size of four kilobytes and a page table entry size of eight bytes, meaning that like for each page table we have in our hierarchy, we essentially have 512 page table entries. This makes calculations a bit easier. Now, the issue with page tables is that they are unswappable and unmovable. Everybody who knows me, I'm dealing a lot with memory hot on block, knows that I hate unmovable and unswappable memory because it makes things very weird when you want to hot on block memory. And it's charged like whenever we allocate a user space page table, it's charged as a column allocation. And in the C group V1 version, we actually had a mechanism to limit the amount of column memory that a C group could allocate, which was nicely described here as a way to denial of servers to your system. In C group V2, it does no longer exist as far as I understood. I'm not saying that's a bad thing because usually if you want to set a limit on the number of kernel allocations, the process can have, like you can only come up with matching numbers, so that's no good. So maybe we should much rather try to find out, for example, why we have many page tables and reclaim them instead of trying to use the hammer on the nail here and just say, yeah, we're gonna limit the amount of memory, column memory process can consume or C group can consume. If we take a look at the page table overhead, that means like if you have an M-app in user space, a VMA, how much overhead you actually have with page tables, then in the scenario that I gave, you usually need like for each page you map, you need eight bytes of page table entry. So that's roughly 0.2% of your M-app size. When you take a look at the PMDs, it's even less and once we are at the PUD level, it gets fairly neglectable. But if we take a look at, for example, a single two terabyte mapping, we can have four gigabytes of PD tables, which is a lot. Eight megabytes of PMD tables, which can still be problematic, especially assuming we have even bigger mappings and user space tries to do nasty stuff with that, but on the PUD level, like usually we don't care anymore. And one interesting thing is when we're talking about transparent huge pages, anonymous transparent huge pages, that we always allocate the PTE table as well because we have to, like all the random context, we have to be able to insert a PTE table to essentially split the P and D mapping, which means like you cannot, for example, get rid of the 0.2% just by using transparent huge pages and being lucky that doesn't work that way. So we already have page table reclaimed to some degree in upstream Linux, which is whenever we change our VMA layout, for example, we M-unmap or we M-app with map fixed or something like that, we remove all now empty page tables because they're outside of our VMAs. So we can remove quite a lot of page tables by doing expensive VMA changes, but there are use cases that really don't want to do that, which are essentially sparse memory mappings, meaning you have like a very large VMA space and you will want to dynamically, for example, allocate and deallocate memory and avoid, for example, running out of VMAs, stuff like that. And there are benign user space applications that make use of that, for example, memory ballooning in friends, like with a balloon, with a mem, when exposing memory to the virtual machines and hyperbysy would have like a very large sparse mapping, essentially, the memory allocators that do the same, which really want to avoid doing expensive VMA changes. And I learned that there are fancy metrics algorithms that rely on sparsity in a layup. That's why you have like a very large sparse and map area. And of course, if we consider like a large file that you map into memory, not all pages might be mapped into the page table at the single point in time because they might just be written out to storage and evicted, in that case, there is also like essentially a sparse memory mapping at that point. But unfortunately also malicious user space can make use of that and there are ways to trigger a location of a lot of page tables for a process and you can actually calculate how much harm can be done just by taking a look at what the maximum VMA space is that a single process can consume. And essentially what we can do then is we can break memory over commit, we can break swapping, we can break compaction, CMA is so movable, so it's all a big mess, I would say, if we try to poke it with a stick. And essentially for sparse memory mapping, there are two cases I think that can happen. On one hand, we can have a lot of empty page tables that we can just remove. For example, that can happen with mAdvice, where mAdvice don't need semantics, fLocate with punch hole, or just ordinary reclaim, for example, of file mapping. But there are also other cases of page tables that we might be able to reclaim because essentially they're reconstructable and reconstructable here means if we throw away the page table, we can just recreate the content because we know what we're doing. For example, on certain mappings, we might map the shared zero page on read access and we can do the same again if we were to throw away the page table. And the same is, I assume, with page cache pages because we can just refold them from the page cache and we're fine. So even though we are trying to look at primarily empty page tables and reclaiming them, I want to raise awareness that there is actually much more harm that can be done, for example, by the shared zero page. So what we have upstream right now is a proposal by a USN as a way to reclaim at least the PTE tables, meaning the lowest page tables. As soon as there are no entries left anymore and it works essentially by having an additional counter inside of the struct page and that counter on the one hand keeps track of entries inside of the page table that are not empty and of current page table walkers. It means like as soon as there are no page table walkers anymore, so nobody references it and there are no entries anymore, we can essentially remove the empty page table. Like if you have, I call it unmap or a SAP. So if you would do an M advice, don't need on a single page, it's the last remaining page. We can actually remove them page tables across VMAs or even page tables that don't even lie inside of VMAs anymore. Now we come to the fun part with shared zero page. Essentially, I have a couple of examples. So this is the simplest example how you can trigger a location of a lot of empty page tables. So you have a very large mapping. So I start with 100 rate exabyte, but usually on my laptop it's 64 terabytes. It varies I think on a processor support. You map it readable and writable because we want to write to it and because it's so big and we might have setups that disable memory overcommit, we might have to pass the map no reserve flag here. And then we can create on certainly like on most systems, but not all we can actually create such a large mapping and then what we're just gonna do is we're gonna write to one page inside of a two megabyte fragment. That means like we populate a PMD page and exactly like one page table, PTE table, and we just map like a single entry in there, which means that we populate essentially with this algorithm where we first write to one of these pages and then remove it immediately. Again, we allocate one PTE table and above hierarchy and you can consume quite a lot of memory with that. But the thing is like if you have a system where memory overcommit is completely disabled and even map no reserve won't help you. So in that case, you cannot allocate, for example, on that notebook 64 terabytes. It's not gonna work with memory overcommit disabled. But that's where the fun begins because what about we do a read mapping? Read them on mappings, kernel allows for that. We can have as much as we want even if memory overcommit is disabled because what can happen? What could possibly go wrong? We don't need a map no reserve anymore. It will just work. What we can do then is we just like read a single page and what a kernel would do is it will populate the shared zero page. So we have a lot of page tables just filled with the shared zero page. And yeah, this will work quite fine. We can consume a lot of page tables that way. And we could argue like these are reclaimable page tables because we can just throw away the page table on the map shared zero page. And whenever there would be an access again we would simply refold the shared zero page. But there's also an example I think that's valuable which are we unreclaimable page tables. It's essentially the same mechanism. There is a lot of code like we have userfoldfd involved. And usually when you hear userfoldfd you have to be careful because it relies on for example stuff that has already been folded into a page table to not trigger fold again. And what we do here is essentially even more fun because we have a broad non-mapping. So we don't even need read semantics anymore. We can add as much as we want of that 64 terabytes on my notebook works. And then we're just gonna like place a zero page into that broad non-mapping using userfoldfd and we can trigger the exact same thing. So I have a simple example here that throttles star locations a bit. So up here we can see like we have right now 28 megabytes of page tables. And if I for example, trigger here the reclaimable page tables part you can see that it goes crazy fairly easily like we're hitting 200 megabytes and I'm throttling the locations right now. But on the other hand, you can see that this process it has like 65 terabytes of virtual memory and it essentially consumes no other memory. So I'm like just looking at that. It looks a little bit weird. We already had one gigabyte I'm gonna stop because otherwise I'm gonna run into issues here. But you can see that it's fairly easy to trigger this. But like when I realized which kind of magic you can do with that it gets even more fun because like does anybody here know an interface that you can use to inspect if something has a mapping or not? Like in user space to observe if there's something mapped or there's nothing mapped and there are certain virtual address. So it's broke page map if that clarifies. So with the page map you can identify like it's at that virtual memory a shared zero page map or not. And you can control if there's a shared zero page map or not. So what you can essentially do is you can write a memory allocator that stores information in the page tables. That means like you're gonna store one bit per PTE which is an overhead of one bit for eight bytes which is bad. But you can do it and you can write in this example. So this is just the example I have. You can allocate for example two megabytes what it will do under the hood it will reserve like an area that is two megabytes time the size of a base page. Then you're just gonna poke it with a stick and you can actually store and retrieve data in there. And it's like a proton on mapping. That's I think it's the weirdest part you can store memory in there. What I find like a little bit concerning is that you could hide stuff on the calendar. Like you could hide something from an antivirus scanner. You could hide secrets. I think this is even a better way to store secrets than for example using mlock or something like that because there's no way somebody else can like understand what's going on here unless they're able to inspect the page tables using the page map. So yeah, this is fun I would say. No, obviously the question is what can we do about that? And I mean I started to write up some stuff. I hope that maybe somebody here has an idea of what we can do. But essentially a question we have to ask ourselves is what do we actually want to reclaim? And the thing here is on the one hand we have empty page tables. That's the low hanging fruit I would say for example, using that approach that we have upstream right now. Reclaimable page tables are a little bit more involved. For example when we map to shared zero page and then we have unreclaimable page tables. And here essentially the problem is for example you have a page table that maps to shared zero page but you have user folder D enabled. How would you know that you can actually throw away that page table including that information that's valuable for user folder D? Same applies for soft dirty tracking for example where we store additional information inside of the page table. For example that certain page has been soft dirty. So we have to be very careful with that. Other question like we might want to ask is like do we only want to concentrate of leaf page tables meaning PTE tables as we had in that proposed upstream or do we care about higher level page tables? And I think like we really at least care about P and D tables as my calculation showed but maybe we just want to do it like completely then if we already have to tackle two levels and like reclaim all of them where possible. The other thing is if we would want to care about hot page tables meaning for example if you have a page that page table that is reclaimable meaning it maps something right now like would you want to throw that away and essentially for example refold stuff from the page cache or would you want to have some kind of awareness like which page table has actually recently been accessed or not. We don't have something like I'll allow you for that but certainly it would be interesting. But I guess this would then be one thing to look into once we actually realized that we care about such scenarios and essentially when we figure out when we want to reclaim. Which brings us to the next question like when do we want to reclaim? And I mean the easiest thing is when a page table is empty as upstream showed because you can have some smart logic and then just like detect page tables empty using for example that reference counter and remove the page table. But what about the shared zero page as we seen like how would you want to do that? You would actually have to scan the page tables somehow scanning a process page tables is expensive might need a MF block in read mode. So one idea I had is that maybe you would want to start reclaiming as soon as there is like a suspicious amount of zero page usage. The question is then what is the suspicious usage of the shared zero page? For example in virtual machines we often have like for example cases where we have a lot of shared zero page mapped which is a benign use case and not a malicious one but actually like I have no idea how you would want to tackle this except when special case for example the shared zero page. Other things would be like we would want to reclaim on magic conditions meaning we saw in my notebook like it was consuming like one gigabyte of page tables and it had essentially no resident memory and nothing in there. So maybe that could be an indication that you at least would want to give it a shot to reclaim and once you find out that you actually weren't able to reclaim that much you might just flag the process as doing something special for example that's really a virtual machine that just results in these weird scenarios but I guess it's also like more a way of finding heuristics you could apply. And last but not least one idea I had is like instead of scanning for example processes stuff like that what if your page tables were actually either movable or migratable or at least to some degree you would detect out of memory migration context or memory compaction context that the page you're looking at right now is essentially either an empty page table or it just maps to shared zero page and then remove the page table from that PFN walker essentially so I think there are two approaches like on the one hand we would have to scan the process page tables to find that there are shared series pages. The other hand we could scan the PFNs for example during memory compaction just like once we stumble over an empty page table just remove it and as compaction for example might be triggered once we're low on memory that might be worthwhile to look into. Michael you have a question or a comment? Comment ideally. I think it's working. Okay it's working. Yeah so the first question I would have is would it make sense to kind of categorize those different page table pages those that are clearly reclaimable to even spread them somehow more to those that are disposable really easily and put them on all are you like the page cache and age them as a page cache. You do not have a reference bit for them but you can judge by how many page tables there are actually or how many entries there are in the in that page. I think even on x86 there is an access flag for page tables I think at least in the upper layer PTE so it would be harder to get access to. Yeah but essentially put those pages into all our use and use the general scanning algorithms that we have because that was something that we were playing with as an idea in the past. I don't think that we ever had a code for that but as it seems that for those cases when we are doing unmapping without really calling moon map that would be probably needed anyway. So yeah just put those pages on all our use and see whether that can be reclaimable. At least that would work for page cache and for page table pages which do not which are pointing to zero pages and cases like that. Have you considered that or play? I haven't considered that yet. I mean I mentioned here we don't have something like other you but I mean we could make it work eventually and then like it would make sense to only put like as you said the obviously reclaimable page tables in there and not anything else. And it's especially an issue like once you have VMAs that for example and like have user-fold FD enabled stuff like that it could work but for me one issue is still that like the one example I have with user-fold FD it relies on the fact that like you're not allowed to remove shared zero page and I wonder how you could make that work with user-fold FD like even if we fix the other stuff malicious users space could still go ahead and use user-fold FD to trigger the same thing. Then we might want to think about for example disabling the huge the placing a shared zero page for certain applications like to make it only available to privileged processes so to say. So I completely agree with the reclaimable part or reconstructable page tables to enlist them and to scan them. The interesting thing is then how to actually reclaim page tables because I mean that brings us to the next question how to reclaim. I mean for the proposal that we have upstream it makes use of the fact that each page table walker actually references the page table using the PD ref and your entries referenced that and if you want to remove the page table essentially you have to be the last one that drops the reference card and you remove it which is quite neat because you don't really need the MF block or something like that to make it work. Like you don't need any page table locks except the one of your parent I think but you can look up the parent using the struck page. So that could work the thing is as soon as you want to step something in a page table on your in context just of the page table from a PFN walker I think it could be quite challenging to for example notify and a new notifiers about that if you're like you don't really have the MF lock held or something like that. So that might be quite interesting I think to dig further in because here I also mentioned like scanning process page table needs to MF lock and you can like once you scan you can actually set entries for example shared zero page, page cache pages, whatever you want and you can properly synchronize with an MF new notifiers for example you can do proper MF flushes because you're in the context of the process but I guess maybe it's just similar to ordinary page reclaim where you're also like out of you have your arm at work and essentially your arm at blocks protect you from doing stuff so maybe that could work. The other idea is when we don't look at like the example of the shared zero page which is really nasty but just as like page tables that are mapping page, page cache pages so just the five mapping. Essentially when we are in the memory pressure we already start reclaiming and under reclaim we essentially produce empty page tables when doing it right. Like we un-map the page we set individual page table entries and if we could for example teach VM scan to eventually maybe also consider locality inside of a page table to some degree then you could maybe make VM scan produce empty page tables and the empty page tables could be removed using a different mechanism but of course this won't work for the shared zero page and there are some other corner cases I think once you have M-locked memory VM scan will not set memory in other processes just because like one process has a certain page M-locked but these are corner cases. Then again the other thing is during memory compaction or migration and then the big question about for me is how to reclaim from a PFN walker it's fairly complicated. Recently I took a look at page table migration which is completely insane but you essentially have the same problem like when you hot on black memory you might have to migrate memory away out of that PFN walk you would have to move to a page table which is also quite complicated and especially when it comes to locking and making sure that all page table walkers are properly excluded and synchronized. This brings us to the other point that removing a page table is actually I think the most difficult part and the proposed upstream made it quite nice with the PDE ref and synchronizing against all these. The question for me would be how you could actually make that work for example for PMD tables because it's no longer that easy and the further you move up in your hierarchy of page table for example if you would want to reclaim a POD table it could be that there are some page table walkers scanning over the page table and it is holding a reference for all eternity. I think one prime example is once you tear down a certainly big process it just takes a long time to tear down that whole mapping using the M-lock and it can I think take up to 30 minutes. So during these 30 minutes it could be that somebody is just holding a reference to this upper level higher level page table essentially stopping you blocking you from reclaiming the page table which would mean for me that like once we start reclaiming page tables that are higher in the hierarchy you would actually have to teach page table walkers to let go temporarily of page table references and look them up again because otherwise you might end up with scenarios where like you kind of reclaim a page table for 30 minutes simply because somebody else is holding the M-lock and just doing a big pass over all of the entries that are in there using or for example our page table walk infrastructure. Last but not least how to win against malicious user space I mean obviously if you start reclaiming out of VM scan context or something like that it could be if like user space just goes crazy populating the shared steel pages in my examples that if you do it in a naive way to reclaim once you run into memory pressure that like a malicious user space could still like do quite some harm to your overall machine I would say especially once you think about like running multiple containers and then you have somebody going crazy on unmovable memory it could still be problematic and one idea just I had was that like you throttling might be difficult but what you might want to think about is flagging a certain process as a candidate for example to reclaim page tables because it's doing something that's suspicious like consuming one gigabyte of page tables and once you detect that you could flag that browser as for example to really trigger reclaim before going back to user space and while that is stuck in that mode it would not be able for example to allocate another set of page tables but of course it gets more difficult if you have multiple processes involved to do that yeah yeah sorry I can repeat are there any known workloads like real workloads which have a high ratio of page table size to like physical memory size so the examples I have here is like for example memory ballooning what we have sometimes or even with word or memory you have for example two terabyte mapping and the guest was using memory part of that and then you reboot and essentially you remove all of the memory using them advice don't need and might not want to give it back but it could then happen for example if you have two terabytes of that that you actually have four gigabytes of page tables left so like it looks like oh my VM is not actually consuming memory via this mapping and it isn't but there's still so many page tables lying around but for that use case I mean that would be a benign use case you could say oh yeah we have to reclaim that memory somehow that's okay but the thing is that especially with memory ballooning you could also end up in a scenario where you have populated a shared zero page everywhere that is once your guest operating system inflated the balloon what it will do is which is hand back the memory and advice don't need it but especially older Linux systems in I think Windows what they will do is for example when you trigger a memory dump inside your guest operating system that they just don't care that like a page has been inflated or not there so they end up dumping all of the inflated memory which ends up resulting in read accesses to all of the VMA essentially so your populated shared zero page everywhere so which is also like a same use case in that sense although it's like really a niche corner case and again here the thing would be that like if you have running something in your virtual machine that's malicious it could affect the hypervisor just by reading for example balloon inflated memory so you could end up with somehow the same scenario but yeah like coming back to your question I think like if you have a large MF file that is written back to disk or like evicted and I think you can easily have such cases so the bigger your files assume the mapping the easier it is to get to that point yeah say something about the timing about when to reclaim and scan the table I think multi-generation NLU may be a good timing I think it does scan the table linearly so it may be a good timing to scan the table and find out the reclaimable tables assume we are going to merge the multi-generation NLU That's a good question so what you're saying is if you would apply multi-generational NLU that you could put the different levels of reclaimable into different categories No I mean multi-generation NLU does scan the process table linearly to find out the code page so I think you can, anyway you scan the tables you can find out the reclaimable tables at the same time I guess I mean that's true you could do that I was actually playing with like I was implementing just some naive way of doing that like I was scanning processes and I was sapping the shared zero page to see what I can do, what I can do essentially like as you say multi-generational I love you also, scans the page tables I think to extract access bit if I'm not wrong, under dirty bit what I ended up doing is I didn't hold the MAP log and read mode the whole time but I was scanning similar to what Gupfast does I was disabling interrupts I was scanning one page table if it was a candidate I would go ahead and like do the full MAP log MAP log write mode and would remove the page table but as soon as you have to use like the MAP mode and write mode you're essentially doomed against anybody who is malicious because they can just end up blocking you like and doing crazy kinds of allocations but yeah, I agree but anyway, I think we can find out the reclaimable tables and when to remove the table I think there maybe could be a delay to some time later yeah, the issue is oh yeah I'm gonna answer that and we're done with the talk so essentially we could defer that but even if you defer you always have to issue that like if you do it in a knife way removing page tables is a very expensive operation if you take a look at KUH page D it has to do like take the MAP log in write mode it has to take the RMAP logs in write mode and as soon as you have like a page table that's spanned by multiple VMAs you essentially like you're going crazy essentially because you have to take so many write logs so we have to come up with a better way for example to remove a page table the knife way using it via this PDF I think it works for the PDE tables I'm not so sure if it works for example for PMD tables because once you're essentially removing a PDE table because it was the last one it would like have to cascade but the cascading is problematic with the way it does locking with like taking the parent lock of the page table making sure that that entry can change stuff like that so I think that's gonna be the biggest issue like finding page tables to read them might be easy removing them might be hard but to conclude I think once we solve that problem to some degree we might be able to reuse it for KUH page D to remove a page table maybe more cleanly or more easily compared to this whole MAP lock and write mode dance and what so so that concludes my presentation thanks a lot for your time and we can talk in the hallway if there are any questions thanks a lot