 Right so a I'm going to talk about a Derek Mac fragmentation and how can we reduce that a by means of a Caching of large pages at some level at the core mm Now a there are a few first of all it's pretty much x86 centric because Other architectures either don't care or just cannot fragment their direct map a arm can't Can't implement set memory primitives when they map the physical memory with the large pages a Power PC implements something entirely different and I think it works only at boot time So for now, it's pretty much x86 who cares about the large pages in the direct map a Cases that do fragment direct map As of now is a anything that allocates code like like models BPF programs F trace K probes There is a secret memory a Last email told that there is a potential use case for these things for Reducing direct map fragmentation for SNP and TDX. I can't say I understand how but maybe it is the case and As we are going more and more secure there will be additional protection mechanisms I presume that will use set memory operations to change Permissions on these on that part of the this on that range of the direct map and That will make the situation worse for for example a recage comp posted a patch about using pks for page table protection and That they also required the 4k pages and so on They there were several ideas the first thing that came when I posted secret man was oh you're fragmenting our direct map so The idea was to cash to allocate every time to make page and they use the page the 4k pages from that to make a page for the user that needs Different permissions on 4k basis There was several different proposals for catching them the initial a secret mem caching that was Doing pretty much what I said inside secret mem then Rick proposed to use shrinkers close to the user and Whenever there is a memory pressure you thought they those shrinkers to release 4k pages from the fragmented pages in the direct map and the last one what I try to do is to implement GFP unmapped flag It in you tape may make my great type for unmapped pages So the idea is that a once once we have an allocation of to make page Once we have a request for gfp unmapped we use existing page allocator mechanisms to Populate the migrate unmapped the entire page block is marked as unmapped and the all three pages in that page block Removed from the direct map This causes a single split of to make page and we get to make worth of up to to make worth of free memory that can be a That can be that can can be allocated to those who need the protections at 4k g granularity Then whenever the page is freed it remains unmapped on the free list and its color responsibility to a Mepid some words for instance secret man would map it to the user page tables and and I did an experiment with the V malak with the Modular look in x86 that allocates a memory for code pages The malak that has gfp unmapped will do essentially mapping inside the V malak area and It kind of works It's so a I'm not huge expert on page allocator, so I could have missed something really important I Still run into issues with till be flashing here and there And but what do you say this looks sane? nobody knows Mike Michal can you give the mic there? Yeah, so So my experience is that smashing everything into the page allocator is usually not the best way to do that because Essentially any change like that for those Mostly outliers is bringing up overhead for those that the allocator is heavily optimized for and and I'm not sure whether we have mal online. He he probably has a lot to say about that. So I just that's over here Can I be heard? Yes, I can hear myself. I go back. Thanks. I think to wrap it put in something like that Strictly it's the page allocator would be serious overkill There would be the storage requirements for us In the zone and what not but aside from that Migrate unmapped cannot fall back to any other area nor any other Sorry, the migrate unmapped cannot fall back to any other migrate type nor vice versa because of the nature of what it is Meaning that's there. It's just you're just carving out the parts of the Available allocator for one user and for one user it would be more appropriate to use a mimple Mimpool there's only deals that only gets refiltered the pieces with two megabyte pages. I didn't carve out and the migrate a unmapped falls back to unmovable and Moveable and the the normal normal migrate types can fall back to migrate unmapped Then you have pages that would that's have different direct map that Did what's the point at all that white? Do you need a migrate unmapped when you can just call the page? The page is unmapped only but only when it is on free list on my great unmapped or Or when it is used by something that maps it elsewhere But when there is a fallback from any other migrate type to migrate unmapped It restores the direct map mapping for present. It wouldn't restore to mag mapping, but it will still be present in the direct map It seems that energy shuffles the problem around Because the most aspect is its timing Like it granted it's it's a migrate type to be kept all together So that free in mass or be allocated in mass was once migrate types are able to fall back to it That's eventually gets polluted when you lose all advantage So give basically given a long enough length of time to direct map gets fragmented anyway So it just feel that it But I Don't know I don't think we have any solutions that won't fragment direct mapping the end for Long enough running machine I'd still be reluctant to For something that is eventually going to fragment is anyway Don't see why did we carry the complexity inside the page allocator with something a bit who do the same job Another possibility is to care carry this complexity next to page allocator Or do it for every user like BPF has an its own cache a secret name has its own cache Page tables with pks have their own cache I don't really think that's so great So if we are to avoid fragmentation of direct map or at least reduce it we need to have the caching somewhere right Yeah, so I would say mempool because I think migrate types is not free they need all of the not only that the need all of the of the additional body Double things lists for each migrate type that's added both it increases size the page block bitmap as well So you're at this small amounts of sensors just Overs and while it's not absolutely killer to the concept I'm just not I'm not a fan on that basis that we're increasing the complexity cost and storage cost of the page allocator and They It could just have been done with a slap cash a slap cash one That's only ever allocates order nine pages and a spit in 4k chunks Or again memory pool Memory pool that only refills or depeats and two megabytes. I had an RFC like a while ago That it was a dedicated cache and that allocates to make chunks as a splitter split that them up and The feedback I got from several people was that probably just move it to the page allocator I'm sure you didn't look at that. So No, I didn't I Can't try to find it so Think it'd be a better off finding it after the call because they're using up I can send it somewhere And So that idea was to have a cache next to page allocator, so and it still had some hooks into the page allocator Probably slap would be a better idea. I don't know. I know that the BPF folks say got strong push pull pushback for using huge VMA look for that And I also think the malloc doesn't really help them At least in their case They didn't have a fallback for 4k pages. So eventually they would run out of to make pages and that will stop anything Okay, so for the GFP on that part if I understand you correctly We are also Experimenting with the similar idea, but for different purpose. So basically we want to improve compaction And to reduce external fermentation. So our idea is to group page blocks by page types Basically the idea is you know has been There for a long time. We want to group allocations from different paid blocks by lifetimes You know we objects allocates in different paid blocks and so In addition to move ball on move ball So the general idea is like, okay, a non-pages goes into Same group page blocks file pages go into the another different groups different group of page blocks and also we also Distinguish mapped unmapped pages because unmapped pages within a To migrate a page block containing only unmapped pages would be faster because you it doesn't require our tail be flush and unmapping right To unmapped it is The there is a tail be flesh when you unmapped and there is an additional tail be flesh because we need to map it back and Unmapped it back in a prep a new page right because if a paid block contains only unmapped pages and The other page block contains mapped pages And are you guys talking about two different unmapped? I think it is different. I'm talking about You talking about the They're talking about a curse base. Yes. Yes So, okay for the noise Last time we'll go ahead Yeah, you're on mute if you're talking No, we can't do you So we'll go on the next question. You can tap it in the chat or see if you can figure out the The talk I just have a like a general comment and like everybody who knows me knows that I detest unmovable memory It isn't movable. It is unmovable and as we discussed like there might be ways for secret man to to turn it conditionally movable so my question would be if you turn it like into a migrate type and once you have like the same type of Moveable unmapped and you would need yet another migrate type to to reflect that I assume That for me might be an indication that you might actually want to have like External caches that resemble like these these details like for example secret man could support migration of unmet pages in some scenarios and then T-dex might not support it, but some other might support it So you would have like two different caches that cash like the different Semantics T-dex might support it. I don't know how much effort it would be for KVM itself Yeah, but like just in general like it might be possible maybe in 10 years maybe in 20 years secret mem could support movable pages probably a pks say for page tables. I don't think so Now the the code pages let's say for BPF and modules Also would be on movable right That's just like in in general like if you punch this into a migrate type You'd need yet another migrate type most probably once we turn it to like movable At least some some things that might indicate at least to me that it might be better to have to somehow So my idea was to fall back first on unmovable anyway, and then on reclaimable and as the last resort to movable So if I take a page block from my great type, I try to avoid movable So it's anyway unmovable and it's kept unmovable, right? But for long-running systems everything can happen, right? and external cache with steel It still will be quite complex to make these pages in external cache movable Unless you do a cache per user, and then you duplicate pretty much of things, okay, I would have another Another reason why the page allocator is probably not the best place because if you really have to fall back or do Reuse and existing Unmatt memory done this would do all the flushing and Essentially what you get is an unpredictable and hard to debug spiking performance which can be seen in many places then and I suspect this will generate more problems than Than it actually solves so I guess I'm with Melon on the idea that you should probably build on top of the page allocator make use those all those use cases use that You use it and and then you get another slop or a slop cash or yeah I'll read Vesma's comment He said if I recall correctly the idea of using new migrate type was to avoid the situation where we fragment memory with the two Megabyte blocks for unmapped memory that are then sparsely used He also goes on to say the fallbacks then allow Using the memory with the cost of fragmenting direct map You're unmuted right now on the our side last meal if you want to try to jump in again, but we weren't we weren't able to hear you last time Can repeat the last sentence. He said the fallbacks then Allow using the memory with the cost of fragmenting direct map And potential and performance spikes on allocation and free path Although the TLB flashing can be made local if we do preempt disable at prep new page So so then the short map and map in the prep new page of free page something Only need To flush local TLB. It's essentially Local one page from the local TLB we are talking about users that count every single cycle and I mean every CPU cycle when they are allocating this would be in a range of unacceptable so I'm not really sure this will this will fly the Another way is to keep them on a Well anyway, you still have to unmaps and rain flush till be yes Yeah, so just to rub up. I would prefer to have less memory that The allocator are is seeing rather than having unpredictable performance spikes just because we would try to squeeze every single bite That might be the lane not exactly squeezing every single bite. It's more about a Trying to avoid fragmenting in a central place like there is any we anyway have to do something because like BPF a Fragments the code and the impression on LT. It'll be much more severe. They're trying to fix that separately, though With their back with their custom allocator where it's just they got two megabyte pages and they pack and they reuse So we need like five custom allocators right now, right? But you can build one and they got a huge pushback on that one if you didn't see so a I'll probably talk with lasting mail about Doing something slab like a slab flag maybe So every user would be able to do came in cash create with something and That came in cash will be backed up by two Mac pages probably and I be heard now Yes, yes, oh great. I had to restart it I Think I wanted to know that if I read correctly the BPF Stuff has another constraint that the programs are of different size so They cannot use slap cash because that's designed for objects of the same size So Yeah They already have even before to keep me for two Mac pages they had an allocator that Shares memory between different programs. No No How if it's just a couple by we just a module alloc and that's all yes Yeah, and then they would update the direct map and split it Yeah, I've read the LW and this we did They are trying to push a consolidated allocated that had to be disabled in the end, but And they need it by 519. Yes So, yes, so we can probably Forget about that use case, but yeah, otherwise we can try the slap approach, but I thought one of the One of the use cases was also that A memory that was initially allocated as normally would be then converted to unmapped on the fly and that Wouldn't be possible if If it had to be Like pre-allocated from the special But with my great fight it can be done with the cost of not being the most efficient at the moment if we do something like slap that backs a cash with two Mac pages and Hands out 4k pages it could work if So the there is a indeed limitation for object size that Such cash would provide all the use cases. I was talking about a war about getting 4k page Okay, so a generic Generic slab with a I don't know how much memory anyway, you can't have Objects with different protection in the same page, right? Yeah, I don't know the if there are such users at all Okay, I guess you can try this approach. Yeah, so I'm I'm the one set far on the BPF side, so From my point of I think it makes sense to have like some cash to Make sure we do not go all the way to 4k page Because we think about a one gigabyte page. That's a one entry in the page table You split it into two megabyte. That's the 512 If we give just one 512 page among these two 4k, that's the ones K Entries in the page table So it's like feels like we just split a little bit more. It's actually the doubled overhead on the number of page table entries so in our experience we have been using like all All sort of techniques to reduce the overhead is in TLB especially the instruction TLB So I feel this makes sense to have a special cash. This is For all the like executables like kernel modules trampolines BPF programs like We just have one to make Depends like could be just a handful of like a two megabyte pages just for executable. I think like from our experience that's a huge benefit like a performance We have like a huge pages for application test text. That's like five ten percent Performance benefit like multiply the side of our fleet. That's a huge number We also see like a popular BPF program is on the list of instruction TLB miss and That's why we start thinking about like how we can do that. So from my point of view if we have like Caches that we only do two megabyte page. We do not go all the way down to four kilobytes. I think the benefit will be like a Significant like well, I believe it's not but would be possible. I don't see why not Lost email probably does but I don't I think it's time, right? Yeah. Yeah, I think that's the time Thank you very much. So we still have an agreement. We should need the cash, but we don't know how to do