 It concludes my presentation. Thanks a lot for your time. And we can talk in the hallway if there are any questions. Thanks a lot. We wanted to talk about various scalability issues related to M-map log. And first, I wanted to point out that there's really many different things people mean when they mention M-map log performance. So the first level of that is CPU level like cache line comprehension when you actually have to get to the M-map log. And that's things you would see, for example, if you have a lot of page faults taking the M-map log for read and they just each have to contend on that cache line that holds the M-map log. And that's one of the aspects of performance that people see with M-map log. Second level of performance issue that people see is completely different. It's when they have a process with many threads and these threads try to have different operations within that process, big M-maps, big M-maps, page faults that actually hit disk, things like that. And these operations by different threads in the same process end up blocking on each other. Usually when they do that, it's actually false conflicts. It's things that really could process in parallel that threads each work on their little part of the address space in that process, but the kernel doesn't know that and serializes things on the M-map log. And you end up with threads blocking potentially for long periods of time if you have big M-maps, big M-maps, slow disks that you hit with page faults. The third type of issues that people are seeing is if you have things that access an external process address space, usually through POC files, usually it's like system monitoring software. Maybe you just want to run PS and know what are the command line to that process and that command line is stored in the memory map of that process. And you can also end up blocking an M-maps M in that way and your monitoring process sometimes might be running at a fairly low priority and while it holds that M-map log, it might be blocking yourself or software that it's supposed to monitor. And that's the sort of issues that have been seen both side at Google, Facebook, like everyone's hitting that and having various sorts of dirty workarounds for it, but we're seeing these kind of issues. So there's really different aspects of M-map log performance that different people are seeing and care about and it did no real unified view of what it means to say M-map log performance. We had a big discussion about that three years ago. Right, well I think there was a really big one in Puerto Rico 2019 and we talked about at that time the main approach that was being considered was SPF that was Laurence Patschett. Looking after that, I started working on range looking really, so I was trying to address the issue of false conflicts within a single process. And I'm not gonna go very deep on that, I just want to mention the really broad line. So I kind of split the M-map log in two levels. One that I called VML log that was just protecting data structures like the arbitrary and a few counters that are per process and a few things like that. And that would only be taken for really short periods of time. And then if for things that actually need to block for things that might be held for a longer period of time they would have to be range aware and to know exactly which range of memory they are reserving for read or write. I was able to get that working in a way that you can convert incrementally one code path at a time and it was experimental, I only did M-map, M-unmap and P-trolls. The issue with that was the cost of P-trolls because for every P-troll you would get the cost of updating that shared data structure that says which ones you are currently logged for read and that data structure update was expensive and I kind of left it at that for no I haven't really pursued it any further. So then after that I kind of started looking at SPF again which I will talk about later. SPF is one of the approaches that might be interesting because they help in the first two levels of performance issue that people care about. It can help both in the issue of bouncing the cache line that holds the M-map log because if you don't take it at all for P-trolls then you won't bounce that cache line and it can help in avoiding blocking in some cases if the blocking was caused by P-trolls that held that log for a long time. Yes. I'm wondering, I like the idea of not having to deal with the reader costs of cache lines because you end up dirtying it as you're saying with RCU that doesn't happen anymore. I'm just wondering is that really such a big problem because from my angle I've always seen the biggest problem M-map sum with multi-threaded applications. And wondering if there aren't any in the fault path there aren't any other things that are dirtying the cache line anyway. So you get rid of one problem but you still have the same problem because other data is still being written. I have not seen that. I have seen that I was able to get some performance improvements if you just do a bunch of P-trolls on a system with multiple sockets with SPF. So generally I will agree with you that I think the more interesting case is multi-thread processes and doing many things within the same process and actually blocking. But what I will say is that when you do something like that and you try to benchmark it, a lot of the benchmarks we have in M-M tests they really hit a lot more of the how much cache line comprehension do we have than the thing we're actually interested about. All right, so that was just the kind of general introduction I wanted to make but after that I'm just gonna leave it to you, I am. Okay, is this on? Yeah? So the other, well, a parallel approach that was taken was looking at a specific data structure to handle the VMAs on its own. So Matthew and I have been working on this for a while now and it's pretty close to merge. So it's at the maple tree. Most people probably have heard of it or at least seen it on the mailing lists. It's a B-tree variant. This slide kind of outlines what you could, how you could think about it. It's the biggest difference from all the implementations is that it's top down versus bottom up splitting and inserts are more involved because you can have, when you insert one thing you could be inserting just one thing or you could be overwriting a portion of one thing so you could have one to three inserts. So things get a bit complicated internally. So what's nice about it is that I'm taking the complications out of the MM into the data structure and the data structure is useful other places. We already have someone else trying to use this and find it very interesting, Dave Howells. Maybe in this room, maybe not. Oh, he's in the file system track. So we're trying to make it this data structure just to be a simple interface. Users just want to store their data and we find a lot of places in the kernel people are using either the RB tree with things bolted on to speed it up, which is in the MM or they use the interval tree for non-overlapping rangers and they use those because there really isn't a better alternative or they use linkless. So let's not do that, right? Or both, yes. So this, the maple tree is killing the RB tree usage for VMAs and the linkless usage and the VMA cache. And we're currently pretty much the same performance across the board. Where we want to go is to get to using the RCU part of this tree so we can use it to use do lockless lookups. Basically, the idea would be that we try to, we look up without a lock. If there's anyone pending to change what you're looking at, then you restart from the top. We're trying to figure out exactly what that looks like for all the different locks and all the different ways that things kind of get complicated down in the lower level, pre-allocations because of FS reclaim and all the wonderful world of actually that came up in previous talks. So it's interesting that everyone's kind of hitting the same problems and Noam really has a great answer. And I think the best answer would be to look for better data structures, to be honest. So this is our data structure and I think Matthew has a few slides about the locking problems and how messy they do get that he can share for this RCU future. If you wanna come up. Miracle, it works. All right, yeah, so just to elaborate on what Liam's saying, what we're proposing for the next merge window what Andrew is working hard to get merged in and keep it in is the maple tree for storing VMAs but we're not using any of the RCU functionality. Everything is still being protected by the M-Maps. So everything that we're talking about in terms of RCU is future work, future development. We haven't even finished writing the code yet. I mean, we had a couple of goes at it but there's some problems and it's a big win. Well, it's not a big win. It's a win in terms of code complexity because we're moving code complexity out of the M-M like we're getting rid of the VMA cache. We're getting rid of the doubly linked list to connect all the VMAs together in a nice long chain and we're getting rid of the RB tree or at least the usage of the RB tree there. So what I wanted to talk about was the future of you know, how do we see this going forward? So what I drew up quickly during the last talk, this is how a read fault works today. Right, and I took x86 as the example because why not? So in the x86 do user address fault function, we take the M-Map read lock and we then we call find VMA. And so the whole way after this, we're expecting the VMA to stay stable because we're holding the M-Map read lock and if you're going to change your VMA, you have to take the M-Map write lock and so we're guaranteed it will stay stable. And we pass the VMA down and the problematic bits is when we get into double underscore handle M-M fault and we call P4D alloc, PUD alloc and PMD alloc. And I'm really glad that David went first because he did a fantastic job with those slides explaining what all these acronyms mean. But the problem from an RCU point of view is that those use GFP kernel allocation. So we might end up doing page reclaim, we might end up sleeping, waiting on page write back to happen and memory to become available and it is all a giant mess because obviously you can't sleep while you're holding the RCU read lock. It's actually okay after that. That's the only bit which is really causing me heartburn right now is that we've got these three GFP kernel allocations and to add insult to injury, most of the time you don't even do them because they already exist. But of course the possibility is that you've never touched anything in this P4D before and you've got to allocate all three levels. So, Paul, I'm so glad you're in the audience. I think there is another mic floating around. I've proposed a couple of ideas to Paul and there must be some trenchant questions and I didn't quite get around to answering them. One question is if you had something on SRCU that did not have the read side full memory barriers, would that make things easier? Oh, I wish Laurent was here because... Somebody can find him. Oh, sorry. It's only, you know, 10, 11 o'clock there. Right? Well, it's nine hours difference. It's like 11 o'clock, right? Yeah, he'll be away. Okay, all right. Well, anyway, so SRCU has been tried for this before or for SPF before. There was an SIC variant of it and there were performance problems. Now, this was a few years ago, so maybe those performance problems are now fixed. Performance problems are there, would be there already. The difference is that I might have a way of letting people choose between having the read side be slow, which is the current choice you get, period, and having the right side be, you know, the grace periods be a little more contorted, which, but I would want to use it before I did that. So my question is if you have something like SRCU or you have to have the SRCU struct, but there was a way of doing it, so that, I mean, you're gonna have probably a preempted enable in there, okay, in the read side, so you're gonna have some tests, maybe in both of them, maybe not, I don't know, I'd have to go through it, but there would not be a full memory barrier. If you had that, would that get you to the point where you could just throw the critical section around the whole mess and not worry about it? And that's actually to both of you, Matthew and Michelle, for that matter. I don't know. I think more thought probably required. I mean, my instincts say that actually that should be fine, but I think recent bugs for me have indicated that I do not understand memory barriers in any functional manner. Nobody does. Nobody does. I mean, I think there are cases right now where you would hold the MAP read log for a long time. So I mean, in this case, it might be fine, but the question is like, how long could the grace period be then? Well, that's up to you. It's your SRCU struct. No, seriously, that's the reason that the SRCU has the multiple things. With normal RCU, if anybody anywhere in the kernel decides they wanna hang out in the read side for 100 milliseconds, that affects everybody, all right? The thing about SRC, I mean, you could even have one per process if you wanted to, okay? Well, that's interesting. Yeah, we could embed it into the MAP struct. Yeah, MAP struct needs some extra size, right? I mean, it's too lightweight. We need to get it after two kilobytes or something. I mean, yeah. But, you know, or you could just have a global one for all of them depending on, you know, what the readers are doing, right? I mean, you know, it's, I mean, if you're having trouble allocating one, you're probably having trouble allocating the other. It's not clear having separate is useful, but you know, I don't know the code. Who knows? Most faults will be very fast, but, you know, there's always gonna be once in a while that, you know, hits the allocation cases or hits a slow disk. And so, most faults, you know, take, you know, micro-seconds, and then once in a while, there's one that takes 100 milliseconds and you don't know at the start of the faults which one is gonna be. Right, but is it the case that processes have different backing stores? In other words, if one process is having problems allocating or all the other processes or is there something with C groups and namespaces and something like that that means that some might be fast, some might be slow. Michael's nodding his head, but I'm not sure which way he's answering. Yeah, so essentially you can end up in both MMCG reclaim and the global reclaim, so no luck there. So we would wanna have reprocessed ones, right? Well, very MM-struct. Well, and in FS reclaim, wouldn't we have a variable time based on which FS reclaim we're going into? Said again, I was just too busy not screaming. So the FS reclaim is really a wild card in itself, right? Like we don't know how long that would take, so if we end up in FS reclaim, all bets are off that we're gonna make the deadline. So actually, I mean, I've got a room for the MM people here who can shout at me that I'm stupid and wrong. My inclination is actually to make GFP kernel explicit rather than implicit to all these P4D, PUD and PMD alloc cases. And in this path, at least the first time through, we make them GFP no weight and we actually handle the failure to immediately allocate memory. And so the intent is that the quick path just does it under us to you. And the quick path is we already have these things, we already have these levels of the page table in place or we can allocate them immediately and the pages are in cache. What I haven't shown here because, you know, sides full is that format map pages can fail. It can say, well, the page you've asked for is not in the page cache. Or yeah, the page you've asked for is in the page cache, but we're gonna have to run read ahead in order to fetch the next batch of pages. So yeah, you can have that page, but you're still gonna have to drop the lock and do IO. And so it will return fault flag retry or whatever it is all the way back up. And in that case, we would take the end map read lock again, right? We can fall into the slow path. That's fine. As long as, you know, the 99% cases, we do the whole thing under our CU read lock and never touch it. You know, we're already winning, right? So who's with me on actually adding a GFP flag to these three levels of page table allocation? That would be a lot of code to change. Less than you'd think. We really don't allocate page tables in that many places. Oh, and you're like this, that makes Vmelloc able to operate with not GFP kernel flags. Yeah, I mean, this hasn't been a technical problem, I believe that was mostly that you've got those B-alloc functions for all architectures. So you would have to touch a lot of architecture code. Have you seen how big my phobia of hatch sets have been? Yeah, that's what I'm not looking at. You know, you're not scaring me, hey, Michael. I'm not trying to not scare you because you are hard to get scared, but all the people who would have to look at a code and essentially be aware of all those subtleties that might be there, like, I don't know, those continuous page tables in architecture and whatnot. So I don't think that's a huge technical problem rather than do a lot of work because for some reason, this would be really, really helpful to have from the very beginning, but that's not the case. Do you need a GFP flag or can you just go to the slow path whenever there's not a PMD already there? Because I worry about the case where you have like a big virtual mapping and you start faulting everything sequentially and with GFP no weight, you never go into reclaim, right? So you could pretty quickly deplete all available memory without being forced into reclaim. But because the PMD is mostly gonna be there, right? It's just the first hit that has to allocate it and then all the subsequent ones wouldn't have to fall back. Yeah, so if you're doing the GFP no weight and you would have to reclaim, you just fail. And the response to that is to go back to do user IDRA fault, acquire the mmap read block and then try it again with GFP kernel instead of GFP no weight. Right. So that's gonna force you into the reclaim path unless somebody else did the work for you first. No, then you wouldn't have to update. You wouldn't have to add a GFP flag, right? If you just go to the slow path if there's no PMD yet. Like don't even try to allocate it in the fast path, not even with no weight, just don't. If it's there, you do the fast path. If not, then you fall back. And mostly it's gonna be there. That's less code to touch, but. That's really interesting. We should try that. Okay. All right, thanks. Yeah, so where we're going from here. The... If I can have a question to Paul, actually. So if we have that whole thing in a RCU, what could actually happen if the reclaim or any part of that path just depends on RCU in some really awkward way? Because we simply don't know. Because those are reclaimers that are out of MN hands and so you actually cannot make any assumptions. Is that possible even remotely? So first off, I'll end up giving you like three answers because we have three different things we're talking about. So the first one was a modified SRCU to be fast enough. In that case, it's a different thing off on the side and so there's no interaction unless you make it be. It's your own RCU you've used. So if you put an interaction there, well, okay, you shot yourself in the foot, right? Which happens, I do that a lot myself. So it'd be welcome to the club. The next one is if we used a GFP no weight or whatever it was. In that case, you go off in the reclaim, it probably has RCU readers. If it does a call RCU, that's fine. It just goes off and that's great. If we're doing a synchronized RCU or a synchronized RCU expedited, that'd be a problem. But if that were to happen, a locked up would yell at you really quickly. So if you were doing that approach, my advice would be to do something to just force reclaim on the path manually. I'm pretty sure you can do that. You're looking as if you can't, which maybe you can, but if nothing else, just have something else allocate a whole pile of memory and that'll force it, all right, this can be done. And if you had to then a kernel with locked up enabled and you tried to do a synchronized RCU or RCU critical section, it would yell at you. Okay, but assuming it, and your next thing as well, maybe it happens only sometimes and yeah, I can only help you so much there, right? In the other case where you aren't doing the allocations and you're doing what Johannes was suggesting, then clearly, you know, you aren't, hopefully that doesn't force a reclaim if you don't allocate, but who do I know? Did that answer things or? Yeah, actually yes, because the answer is that this could be really dangerous if any of the Colbex that are living outside of the page vault so they're not under the direct control of that code path actually do something that it's synchronizing RCU, which can be really hard. So for example, just to give one example, in RCU torture, if it were to be doing a callback flood which has made the reclaim happen and it gets the callback from the OOM handler and then says, I'm gonna do an RCU barrier, you think that might cause a problem? Yeah. Okay, well, yeah, there we have an example for you. You can worry about other, I don't wanna stop you from worrying, go ahead and worry about other possibilities while you're at it, is that fair? Yeah. Okay, so we do have an example where it could be a difficult. In the sleepable, it's a full barrier, right? In the sleepable RCU? It is, but where's the catch box when you need one, you know? Okay, so in SRCU as it exists now, yes. Okay, there's a full barrier. That has not always been the case. There was a time a long time ago where there were just three RCU grace periods and SRCU grace period, which caused trouble, all right? It's possible now to make something kind of in-between where we don't have memory barriers. Basically, it would be possible to have a thing where you say, initialize SRCU struct and make it be a fast reader one. And that will be some kind of day, I don't know exactly what right now if I have my head on the update side. If you do a grace period, there's something's gonna take longer because. And I like that because intuitively, intuitively, that's exactly what RCU should be. Like most users will assume that readers are cheap and writers are expensive. So the way you're proposing optimizing a sweep of RCU, that makes sense for any users because like, hey, readers are gonna be cheap but at the cost of writers. I agree, but any is a strong word. All right, so I guess the other part of this is the VMA handling. And in that case, we were thinking that for this to work, once you RCU read a VMA, it can't somehow change to the point that the address you were interested in no longer is in that VMA. And that basically means instead of resizing VMAs, VMA adjust and split would be essentially use new VMAs. So the VMAs would be RCU safe by being RCU freed. So, yeah, so that's kind of the other part of this. I don't know if that's a problem for anyone or how, yeah. If the VMA has changed like that and the old VMA you looked up at the start of the fold is not current anymore, you still need to detect that at the end of the fold probably before you commit any new mappings to the address space. Yeah, so we were thinking about a flag for that in the VMA flags and inactive. So if you hit a VMA that has an inactive flag, you know there's something happening to that VMA and you keep looping until it's gone. Yeah, basically you abort the page and you try again either the same way or with the lock. And that flag on the VMA I think would be synchronized by the page table lock, right? So once you've got the page table lock you know that you can check that flag and it's gonna be valid for the duration of, okay, there's potentially a lot of, oh no, once you have the page table lock you can check the flag because the flag changes on you while you're holding the lock. The person who has set that will then take the page table lock and tear down all the mappings. That's a bit similar to SPF. I mean, I don't know if I want to talk about it right now but there's a lot of similarities. That's good because that seems like the sensible way to do it to me, so I'm glad that it's also the sensible way to do it to you and to Lauren. So that's good. Yeah, I mean, should we compare and contrast our approaches here or do you want to? Yeah, we could do that. So the way that I see Michelle's code is that it's sort of, yours is perhaps separated in time and ours is separated in space. That's, so the SPF version of this is instead of taking the M-map read lock at the top here you take a sequence count on the struct MM struct. And so any modification to any VMA while you're doing the rest of this will be checked right before you do the insertion. And if the sequence count has changed then you know that somebody has changed something somewhere in the M-map struct and so that might be the VMA that we have a handle on. And so we abort, we go back to the top and we take the read lock and we do the whole thing again actually protected by a lock. Whereas what Liam and I are doing is separating in process address space that a VMA is, you're inactive like per VMA but there is no sequence lock on the MM struct. It's simply done by checking the VMA that you were looking at to see whether or not it's being killed by an M-map operation or an M-Protect operation or something. So I mean you're still going to see false retries with our approach, right? Because somebody's called M-Protect on a giant VMA it would have caused that VMA to be split and well now you had to replace two maybe three parts of it with new VMAs. And so there's going to be unnecessary retries still with both approaches but hopefully fewer with our approach. Yeah, I would think that it would be, being per VMA it just means that it has to hit that one area, right? Yeah, it's going to be less. Yeah, I think there's a few places you have to be careful, not just at the commit at the end. So if you go in handle MM fault you know when you go through the existing page savers you do have to be careful for the page savers not to be young from under you which can be done with ASU but always like clearing interrupts so that you won't have TLB should dance depending on architectures. So you have to be careful there. There's the place where we take the page table lock on the page table that we found. Same, you have to make sure that it's still the page table at the end, it's still the page table at the instant where you try to get that log that it hasn't been young from under you and then all the way at the end when you're going to commit your page is when you already have the page table lock you have to make sure your VMA is still the right VMA for what you wanted to do. I think that will be similar with how we do it to SP. So in SPF we kinda have the same approach except all of these three places that I mentioned have their own small RCU protected section and we don't care about the whole thing being one big RCU block. But I think that's kind of the implementation detail so we have one big RCU block or like three or four along the page world. That's kind of a similar idea in, yeah. Yeah, thanks, that, Michelle. You're absolutely right. I forgot to mention that one thing that we're definitely going to need with this approach is that page tables get freed under RCU protection. That's already the case for some architectures. It's even the case for some X86 configurations and I forget the details because I looked at it once and I ran away screaming but I'm going to have to get less scared of that and okay Michael, it turns out there are some things I'm scared of and RCU page table freeing is one of them. So I'm kind of hoping somebody else does that but I'll do it if I have to and David wants, okay, I was gonna give a David a mic but we've got one, thanks. So I mean like I had a look at that whole mess and like, I mean page tables are just horrible. The thing here is whenever, and I think I mentioned that to the SPF series is we don't only need like freeing of the page table under RCU, we have to make sure that also any, let's call it auxiliary data that is clued to the page table gets freed using RCU. For page tables that is, for example, the page table lock on some architectures it's embedded in struck page, another is not, and I think we'll get more into that problem domain once we, for example, use some dynamic allocation of like struck page parts as you imagine. Yeah, that's easier. I mean, maybe it's not so buried deep down in some cold chain, but yeah. We actually have that unresolved issue in the current SPF budget and that's really only, that's configuration dependent. That's if you have a split PT logs and that you're gonna allocate the spin logs instead of just having them in the struck page. And I think the only configuration, legitimate configuration that triggers that is if you have configurative preempt that will cause your spin logs to be bigger and you're gonna hit that issue. I think also, I think on X, if you have 32-bit architectures, I think it might also, but I'm not sure completely. Yeah, I'm not entirely sure. I'm not entirely sure how we want to handle that. We could definitely write a code to also defer the freeing of the split PT logs. Yeah, and that brings me actually to the point that I was trying to come to is that the way we currently free page tables is a mess. And I think like we should defer that whole deconstructed, like there is something called deconstructor for page tables. We should find a way to defer that to the actual freeing. I have no idea how we would do that, but maybe that goes into the same direction of what you proposed and... I've worked on that code before. I have thought about doing that and I just didn't see a particular need to do it that way. You have a use case, let's do it. I can win that patch up for you in 15 minutes. Let's do it. Okay, thanks. Easy. I thought we were going to fight more. Yeah. Sorry. What would be a nice conclusion? Well, I think the best plan would, well, first of all, Mabel Tree, the stuff I have out now doesn't conflict with either path forward. So that's great. If FS Reclaim went away, that would be a great conclusion, but I don't think that's gonna happen. So we're gonna have to figure out allocations with outside the lock, the IMAP lock, for certain things. This is still really, this is step two, right? And then there's other things that can be done to go further in our grand scheme of the beautiful, sunny, rosy-eyed future. Matthew, you wanna? I mean, there's some interesting problems we've been having around slab pre-allocation. And it's like, oh yeah, this is, so it's the basic problem, and we have one of the sub-maintains in the room. This is fantastic. We, the basic problem is the usual, I'm holding a spin lock and I need to allocate memory, right? And so you don't want to go into GFP Reclaim, et cetera, et cetera. So what we've been trying to do is pre-allocate at the top and then take the spin lock and go through. The code paths, yeah, this is updating the Maple Tree. Or even, perhaps the worst is an M-protection in the middle of the VMA, so we need to allocate three new VMAs and we need to allocate three times the height of the Maple Tree plus one nodes. So we need to get quite a lot of memory pre-allocated to be sure that we won't need to allocate memory when we get all the way down to the bottom of the tree and find out that we're in our worst case scenario. But of course that is the worst case scenario, right? And generally, we're not going to need it. So what we really want is a very efficient way to have the slab allocator say, from this slab, I want 28 objects and then a short while later, here's 26 of them back. Mempool, really, is it? Oh, oh, yeah, that's the classic hack. Yeah, I hate Mempool. It's perhaps irrational of me and perhaps we should just be using Mempool. But I mean, I've gone outside my boundaries and I've looked into the slab allocation. It's like, why not just give us a detached free list and then we just pop a couple of things off the top of it and then hand you back and say, here's your detached free list back. Maybe? He's not saying no. We'll see, okay, it's not enough. We could look into this, all right, yeah, okay. We'll do it. He's not here, he's probably watching on the screen. Oh, there you go. He's yelling at us on IRC right now. Yeah, so I really expected a lot more on this. I don't really have anything else. Did you have anything else? Did you want to talk SPF now? Well, I think we're gonna start with SPF soon. Yeah? Yeah, we've done it. I mean, I want to say in general, I think we agree on the big directions that we want to do localized strings, but it's the details, like we always, well, first we keep fighting on the details, but also, I mean, whenever we try it, there's always a few things we didn't see coming and so. I mean, I think it's time we get started actually getting stuff in with that because we've been talking about it for a long time. I might have a question regarding that. So what scared me a bit, and scared, it's the wrong word, but with the SPF series was that it introduced quite some supple, lockless versus locked semantics to a lot of page fault handling code. That scared me a bit. They made a code significantly in my opinion harder to read and understand. With the approach that you're proposing, would that also be true that we would have similarly complicated page fault handlers or would it just feel much more natural, like let's call it that the delta for people that are not a way around the page fault handlers would be smaller? I think the delta would be smaller. I mean, so this is, oh, sorry, I'm not supposed to move away from the podium. So what I have up on the slide is, this is the state of today, right? What I would change from here is the M-map read lock would be not taken the first time round. Once you return with a fault flag retry, we would in fact take the M-map read lock. So it's gonna be if first time round, take the RCU read lock, else take the M-map read lock. So it's not gonna be a huge semantic change there. There's a few extra lines of code, but depending exactly how we solve the P4D Alloc, Peldaloc, PMD Alloc thing, that's a tiny little bit of extra code there. Still, while you work the page table tree, check that it doesn't go away from under you. No, because it's all under the same RCU section. You have to care about that because you have different RCU sections, but I've got one big RCU section. So I can do all of this stuff speculatively and then check that the VMA has a change at the end. RCU won't be, you won't have to clear interrupts for that. I mean, it's not true today, but sure. Yeah, but once the page tables are properly being free by RCU, we won't need to do that stupid interrupt disabling dance. That might also be true for when we acquire the page table lock, that sort of thing. I mean, we have the same issues in like two or three places and right now I kinda do the little dance every time to make it safe, but sure. But so there's gonna be a bit of extra code that's not on this slide where we do the actual insert into the page table and so we'll check the VMA there just to make sure it's not dead. But I see very little change in the file backed path. But I think about file backed stuff because fundamentally I'm a file system guy. I don't think about anonymous memory because I'm not really an MN person. Don't lie. I'm not, I'm a file system person at heart. I don't understand these unnamed pages. They make me uncomfortable. Anon is a lot of the same, but it's a lot more likely that you will have to allocate a page. And then at the end, when you have your page, you kinda have to check. You still have the right VMA, but that's one of the things where it might not be convenient to have the same RCU section because most of the time you may have to allocate a page or at least a lot more often than in the file. I guess something that's going to change a lot of this for both file back and anonymous is using larger pages. Once we start deciding to allocate even like order four pages for both files and anonymous, we're going to see PMD alloc be needed many times more often. And I think that's going to change the whole cost-benefit analysis. Or if it doesn't, it wasn't worth doing. I think of having one single RCU section or several. I don't think it's such a big deal. We could always terminate the RCU section, allocate pages, whatever. Start the next RCU section, check if the VMA is still not expired, whatever the expired bit is called. You can, because you've got the C clock and you know the M-maps haven't gone away. We can't because the VMA may have gone away. So if we drop the RCU lock, we have to re-call find VMA. Now it's in a maple tree rather than a RB tree so it's going to be quicker to find and that may not be a huge performance penalty to do that. But it does mean that I do want to see us at least try to allocate a PMD page before we give up and say, oh, we'll just drop the lock and try again. Yeah, that works with... So the way I do that in SPF is that I actually make a copy of the VMA originally when I get the VMA. And then I can do my check using the sequence counter. I have a sequence counter that's updated by Enivite. But that won't work with you and with looking at this expired VMA bit. That means you're pre-allocating though too, right? You're making a VMA copy. Please understand. Oh, okay. Make sure it doesn't get too big. Also, VMAs have the proto, I guess you don't check the proto, the piece that VMAs have allocations in the VMA itself, right? So if there's anything you need to check in there, don't check it. There's a piece of the VMA. In the VMA, what is the name of the, I don't remember the part that gets cloned. When you clone a VMA, there's certain things that get allocated besides... Oh, no, no, no, I just copied the VMA. Just to start and end. I don't know the reference structure. Yeah, the reference structure. Okay, okay. So don't use them. One thing that I would like to ask is, and we have discussed that two years ago and probably more in the past is that with Maple Tree, do you think that it's still worth to consider the range-looking path or just moving straight way to RCU is the essentially only reasonable choice because as I read the Maple Tree kind of guarantees that you get, maybe just getting the lookup to be RCU aware and do the rest by the range-looking that would be tied to the VMA that it's already... We have less data structure to look at and probably that might help a lot without too much subtlety. So you want to do the range-locking with the Maple Tree as like a half-step to RCU lookups. Right, because that can turn out to be... That might show up a good performance improvements already because you rarely do page faults from different threads on the same VMA. So, and I mean, RB3 was terrible for that kind of thing because you have to do all the rotations when you manipulate stuff. But this should just make it so much easier. So have you considered that or it's just that end and... Yeah, so I was looking at that and I was actually looking at because we do... It's a range tree. You could potentially have a lock per node but it just takes up too much space in the node to do that. And then we started looking at just locking on ranges and it's a lot of complexity only to turn around and throw out. So I'm not sure if you buy much by just doing the half-step. That's my opinion anyways. So one of the approaches that we've explored but not written code for is that we could put a, essentially a read-write SEM into the VMA. And then each, I've forgotten all the details of this because I thought about this a year ago and then I went off to work on folios. So when we look up the VMA, we're using the entire VMA as the range lock essentially. Yeah, that's what I have in mind. And so you would still have contention on the VMA as you acquire the read-write SEM for read. But then you have the read, you can then drop the RCU read lock at various different points because you've got the VMA for read. Right, right. And you know it's not dead. So that actually solves a bunch of problems but it does then create contention on that one VMA. This, what I've been describing is the right-less path. Or at least we're not writing to the VMA struct. We're writing to the page tables, sure, but I mean that's kind of the point of a page fault is the right to the page tables. It is, this is, what I've been describing is a right-less path. And yeah, there is definitely a version of this which is a, which is lockless until you get to the VMA. And we could absolutely do that. And you know, I'm perfectly happy for us to iterate towards an end goal if the community at large is willing to go through all these locking changes over and over again. And you know, maybe we would never get to the right-less stage that I've been describing because it would just be good enough to be RCU. But I think there's applications that have these giant VMAs like terabyte-sized VMAs that are going to say, well, thanks, but you haven't solved my problem. Yeah, that might be a good push for later work for you, but. I was just going for the 10 out of 10 gold star problem. Yeah, I mean, I'm happy to go for the 8 out of 10 solution first if people want that, just recognizing that it will be more disruptive eventually. Over the long term, I don't know if there's anything else. Of the VMA semaphore, it would have to mean that a parallel operation has taken place. That VMA could be going away, in which case the fault is racing with the thing just disappearing. So don't take attention on a VMA semaphore would be as severe as it is on the IMAPSA. Hey, Mal, great to hear from you. Thanks for coming, darling. Yeah, I mean, you're right. It's not going to be nearly as bad. I just think that for some workloads, there's going to be some applications that say, you haven't helped me. The gamble would be that someone that's creating a very large VMA is likely managing it themselves and have done it for the express purpose of avoiding IMAPSEM. So while there would be applications that would have terabyte-sized VMAs, chances are they're managing their own memory quite explicitly for the express purpose of avoiding any parallel operations, meaning it's also less likely to see any contention. There will be some cashline-bouncing acquiring it for REIT, but the level of contention that you'd have for a credit application that is allocating and faulting its own address spaces is completely different to what it is on just pinning the VMA itself. Okay, I mean, if people would rather that we take that step forward and then only later go to this, if that's, I'm perfectly happy to work on that. How about you, Liam? Yeah, I mean, I was looking at, when I was looking at the range locking, I was looking at locking each individual portion, layer of the tree as we walk down, but if we're just going to RCU, REIT, lock, and then lock the particular VMA, then yeah, totally. Thanks everyone. We started already talking about SPF and there was introduction, so I'll have to cut it short a little bit. So I just want to present the current state of the things.