 All right, let's get started Yeah, you want to come here Okay, so today young and I can talk about Flexible orders for unknown pages or non folios. Sorry so at the moment We only support two orders for a non folios Other zero and the PMD order. So also recent CPUs Support the transparent TLB Coalizing basically all three for AMD then and all the four for RM64 And it's in I missed I guess that's it, right? All right, so I can Go ahead. Hi. Sorry. This is Ryan Roberts. I'm just just a very minor point on 64 actually supports two separate things the hardware page aggregation Feature is ordered to normally And then it supports order for for its content bit, which is a explicit hint in the in the page table entry so the Continuous continuous block The first one is not transparent and that one is order for right? Let me let me fix that the one that's transparent to software the one that's completely transparent software It's called hardware page aggregation, right is order two Okay, yes, right contiguous bits, which is not transparent software is order for Yes, yes, thank you. I just fixed this so order two for HPA which is transparent Motivation here Other zero is not scalable. So we have on number of problems here The first one is um, page faults, so on, you know, um, Matthew just talked about the scalable page faults that that this one I guess is Obvious to everyone and the second one is the length of the are all your lists so basically, you know on the phone if you have four gig or eight gig Memory you get a million or two millions for pages on our lists Right and the third problem TLB flash cost so actually six so we have to do IPIs and That means for each 4k page. We have to thin the one IPI to flash our remote CPU on TLB cache right and Of course on larger base page sizes would have better performance at least for some For some workloads, right? The problem here is First we can have on those larger base pages can have high internal orientation because user space allocators they might not be aware of the Base page size, they might just assume Oh, let's say G malloc. They might just assume that the base page size of its 4k, right? and also Larger base page sizes usually require user space changes when we do linkers need to Put code sections need to align code sections, right for 4k For 64k We the linkers would need to Make sure that Different code sections are 64k aligned So it doesn't want to they don't want to do this with 4k pages because that would waste space right, so that means we Generally would have to recompile all user space applications and then also we need to On the kernel side, we need to separate kernel binary Right, and you know, a lot of it larger base pages Usually is a separate distro entirely that includes the user space so also Meet size folios can be a sweet spot spot for unknown Folios and because unknown folios can handle internal fragmentation because we don't have well we have We have one Access but for each base page, which is 4k, right? So also it's a transparent to user space and It's easier to allocate to 10 p.m. The order now when down there is high external fragmentation so also Meet size folios can leverage hardware on TLB calling lesson so if you have Orders that Match or large or larger than the TLB calling the essence orders, then you know the hardware will transparently use fewer members of TLB's so we have a to-do list here and Young you want to take over from here? Okay, sure Okay, so The first one is the allocation policy. I mean to see how I do with time. Okay We want to find suitable VMAs so that we can decide Before or after we have decided what kind of order we want to use for those VMAs because it's basically taking a problem if you have ordered there there if you if you pre-determined order there and then you can Say, okay, those VMAs are not suitable, right? but if you look at VMAs first then try to decide order then that's probably Would be a more flexible, but there also would be a more complexity there so also Do we want to use a single large order for HVMA or do you want to use different orders for on different VMAs? for different different orders for for HVMA that means We're gonna have a MP problem. This generally is a beanpacking problem, right? So which is MP? so Also for back orders if we cannot allocate the ideal order, how do we handle how for back orders, right? We can start we probably could start with PMD order and then try the hardware Correlation TLB order and then eventually fall back to order 0. So it doesn't need to be a bin packing problem I mean in the general case. Yes, it would be the MP hard bin packing problem But certainly for the page cash version of this it is not I make the simplifying assumption that a folio must be Aligned to its size Logically as well as physically and virtually so if you are allocating a folio of Order for it can't start at 15 14 13 it's got to start at 0 16 32 So that that's a sufficiently simplifying assumption. It is no longer a bin packing problem. So Do we go from other zero and then? Ladder and a ladder or do we go from a large order then we fall back to on small orders for now? for an on I Would defer to somebody who understands a non-memory So for for for files We actually start out small and increase as read ahead indicates that we're doing a good job with the sizes that we currently ask We keep keep going up. I don't know that you need to do that Where I do see I'm handling the part that you're talking about here That is there's already a folio in the page cash that overlaps at least part of what I'm trying to fill in I Allocate the largest possible folio that doesn't overlap any existing Folios And again, that's that's not a bin packing problem. That's a binary sort binary search problem or even a linear search Which I end up doing So this is Ryan Roberts arm again I Just wanted to make a couple of points for firstly a general point I've also been working on this area I guess completely independently and have a an RFC out on list which implements a bunch of this So it'd be great if we could you know, maybe collaborate and work together going forwards on some of this and Then the more specific point. I think what what Matthew just described Is exactly, you know, the way I've looked to do it so far and and it's looked to be relatively low overhead and an efficient for But making these decisions So is adaptive heuristic or So how What kind of behavior behavior would we expect and their memory pressure? Because if a system is not at a minimum pressure, I Guess, you know, it's not a problem, but if the and the memory pressure How would this heuristic behave that's that's my central question here actually for this policy Because you know, if a system is not a minimum pressure that means we can always allocate Memory right larger orders, but if it's the end I'm sorry, I was I was gonna say so I think if we assume that this sort of You know, if we're not allocating PMD sized chunks, then we're doing PT mappings and probably You know, we're at a fairly low order to begin with maybe order for something like that I Would suggest the way to do it is to attempt to order Attempt to allocate your order order for well, you have to determine The highest order that you think you can allocate and then attempt to allocate that and if allocation fails because of memory pressure Then you can start reducing your order back to zero Until you end up just with a zero order page, you know, the old path as it were So for memory pressure, it's also The other story is how to reclaim from from the larger folios and If reclaiming requires some allocation of Moment data then that also can be some problem because we are trying to free some memory, but then By freeing memory we actually have to Allocate more memory, right So, yeah, we're gonna touch on this part because oh, okay, probably I'll be clear later. Yeah one more question about the Fallback policy so we could start with the largest order of the when we could hold But once the for example, maybe a PMD order, but if the allocation fail to have to fall back to lower order So shall we try for example from order night all the way down to Auto zero on with keep some others I think it's a kind of a trade-off between the performance and the complexity I mean assuming using the right kind of GFP flags like, you know, don't don't try terribly hard to start You know, don't go into Derek directly claim list. You give me memory if you have it You know, it's pretty quick to go into the page allocation ask for it But I actually think that we haven't asked to the page allocator people which is give me a Give me memory, right up to this order in size and it will return the highest order page that it can without us needing to go through this loop I Don't quite know what the API for that should look like but I think that's something we can reasonably ask for Apparently there's another GFP flag use Now you can simply have a new function that out that for you, so you don't have to repeat that pattern all over again The the system demon above heap actually does this where it tries to match the order it allocates with IOMMU Sizes order sizes. So that's one place you can look where our code does this today Nice. Thank you. Let's move on Also, we need to handle overlaps if we if two threads Currently trying to 14 pages in adjacent areas right, and then they might both try to allocate large folios and These two large folios might just overlap. So we tend to handle this case should be relatively simple and but It is new Okay, so you kind of they're in pitch for handler, right? and Just I guess the simplest way would be the one that fails to map Is folio would just go back and the free it's folio and try to allocate a small one But the might be it might not be the most efficient way to solve this problem But I guess it's quite a simple idealist, right? So you I have one question here. So so normally today we have the like like allocation and then the fallback kind of a police like for example go for the huge page and Fallback to the small page, but there is then the background k-hp hd or something else kind doing the compaction and Replacing so with these multiple like orders do we expect or do we want? that fallback like In the page for PV very fast fast failures, but then the background Doing more work in replacing Not just the huge but also the higher order or whatever the mid-order as well So You mean a capability Yeah, I think we are going to talk about later in the next life collapsing. Yeah Is your question? Yeah, oh, okay. Oh, nice. Thank you. Actually for the Concretely thought when one thread when the race the other thread and Finish the pitfall the other thread get a pit table lock it has to check out the PD in that area Then if it found one PD At least the one PD is not nine at present. You have to Kind of bail out to fail out there. It's all a lot of Peter and just bail out. I think that should be handle like that I mean, it's not a different from today even two threads for the fault the same page In the same PT just the range might be higher Yeah, but you have to but the fall current code. You just take a one PD But for this one chapter take a range of PD That's good. Thank you. So, um, I guess we can all agree on that. This is a tradeoff between performance and simplicity, right? Bottom line. Okay, so Revcon then map account We can probably I think we should reuse the PD map to see peace game and Unless, you know, we want to win something entirely new for this work. I guess, you know, no answer That work needs that we need a new scheme I've sent several emails to the meeting with so for the last year or so outlining various schemes I don't want to take up too much of this. Okay. Yeah, we probably need a new scheme. All right Sounds good to me. So Also compound map count, I guess that should be always minus one Yeah for now. Okay. I like I could say that this is this this needs to be reviewed. Okay TBD this one So we am stats Currently we have a non huge pages. Do we need a new counter or A multiple new corners or, you know, we could just reuse a non huge pages. I guess It's up to how we define huge yeah, so you use it uses expectant that to mean PMD size Mm-hmm, and I maintain that when I was doing the file huge pages equivalent So you should do the same that there's there's a folio test PMD mappable and and that's the test you should use Okay, so Yeah, a folio that is 64k I accounted 16 times or 32 16 times as a single pages. I don't count it at all in the in the huge count without new statistics call user will know that there is some optimization in place with all the two or all the four pages on That's that's because the program runs faster Yes, I understand but how are they gonna debug why it's their program is not running faster Somebody suggested s maps and another suggestion in the room is trace points Maybe the other way is that we could add more counter for each other. No Yeah, I will I would know too Okay PMD map the folio is like Others that's kind as a base page because if we support like time more orders for now Then we get a 10 more counters for you know a noun and then probably double that for a file and then we've got a new page type then we triple that and Then that is question. I just wonder what we need his room Okay Reclaim so basically how we detect internal fabrication this is because HPD has access bit and based on the Aggregation of all access bits from a large folio we can determine its utilization Not just, you know used or not used in terms of, you know hotness and coldness Right and then based on this we need another heuristic to do to decide, you know when to split a Large folio when you know, it's partially cold. No underutilized. So over this Hardware TLB coalescing Do you still get the two separate PD bits? Okay. Yeah, that's the beauty of it So what would what would be the architecture like architectures that don't have this hardware page coalescing What would be the order they look at I mean there is there is a benefit of reduced number of pages in LRU How do we figure out the order that help for? Getting them. We know what a zero and out of PMB For for we don't know the answer is that we simply don't The we shouldn't be changing how we manage memory based on what the architecture can do Because the larger a chunk of memory we can manage the shorter our LRU lists are It is up to architectures to accommodate us on this. It's not up to us to accommodate them Thank you. So I kind of have an open question or a point. I want to make around You know part of what I'm trying to do is is enable the The the contiguous PTE option from 64, which is a essentially an order for mapping but with that So you essentially got four 16 pages occupy one TLB entry in the TLB But with that you only get one access and one dirty bit for the for the whole 64k chunk I Would personally like to see if we can get to a solution where we can get by with that You know one bit of you know one access and one dirty bit for the whole 64k chunk Rather than needing access and dirty for each force 4k sub page I don't know whether that's going to cause lots of issues with with our ability to determine reclaim and therefore Adds to memory pressure. I'd be interested to hear thoughts from the room Sounds good. Thank you Nobody in the room has any suggestions at this point. All right, so swap large folios Yes ETA When someone does the work All right Moving on compaction We need to support Compound folios smaller than the required orders. So I think this should be easy to fix basically right now if we have if Compaction hits a compound Folio Right, it would just skip it That's that works fine because right because that's work. It worked fine previously because Compound folios usually are larger than paid blocks, right? But now we have Folios compound folios that are smaller than paid blocks. We cannot just assume all okay We can skip this compound folio because and move on to next Right, we should try to compact We should try to look at the actual size of this compound folio to determine whether we should skip it So collapsing We need to a One question regarding compaction shouldn't it already be a problem with file back Does page migrations to work or is it also broken so you place it on so movable and it's broken meaning We should disable it in any distribution kernel. So what's the deal? I? Didn't give you a way to disable it We can open bug as and somebody Yes Yeah, yeah So collapsing and first of all we need to find suitable VMAs and then we need to determine a lot of folio sizes. So that Keys K huge PD won't work Against Pedro claim because no under my pressure No, on one side Pedro claim tried to Split large folios and reclaim them on the other side K who's to pay D tried to merge them Or try to collapse them, but that'd be a bad situation So also we might be able to do in in place collapsing That means we don't have to migrate pages right It's possible so But I guess I'll be a low priority task here split and Z from video has posted Split into arbitrary orders. We might take that as well take a ticket events of that page as well, right? We can you know split to firm order 8 to order for probably Right, and we don't have to stick with the you know the PMD order and other zero So also copy on right If we have one PD and three 48 of the page, why would we split? Sorry, I didn't get a question If there is one PT entry for each of the page, I understand if there is a PMD entry splitting is needed It's a PMD level PT entry splitting is needed But for like an order to page we would have like One PT entry for each of the page in the in the in the higher order for you When would we split? Why would we split them? Actually, there are two kind of a split The first one is a split the PMD so it will convert PMD to PT the other one is a split a huge page. So we are talking about split a huge page. So when you Partial do a partial MIP for a range of continuous for continued continuous range for this case you have to split the THB I think and For reclaim if we detect the internal fragmentation, we want to split the THB and To be smaller orders. So we we have to split when a process Partially on maps a large volume Doesn't matter. It's a PMD mapped or on multi PD mapped Like I'm advice might be don't need I'm lock, you know lock partial of the range. We have to Doesn't make sense and memory failure. Yeah, a lot of cases you have to speak to okay so probably I I guess I May have missed some other places that assume the PMD order. So we need to fix those places as a while and Yeah, we are running run on time So any questions three minutes? Oh nice. Thank you more questions. I was just gonna make the point I think there's some M advice places in M advice that that make assumptions about PMD size as well That would be currently broken Made a big collapse for for trying if you're doing a don't need I think At the moment a non PMD size large folio would end up getting skipped and not Freed or whatever whatever don't need actually it's supposed to do. Okay. I think we'll just Look into the cold to see out where we need to fix nice So what questions? Ryan Roberts from arm has the RFC patch set on this work So, please take a look and come in on that patch that Right. All right. Thank you