 All right, so probably we're talking about two points. The first is high-contention on the M-Map lock for multi-priority tasks, as everyone takes a page vault, takes the M-Map sum, and the other is a priority version between tasks, which are monitoring a low-priority task, which is monitoring a high-priority task, and then it blocks and everything goes to hell because the RWCM is fair, which it needs to be, but it does mean that we don't do priority inheritance, and so everything is awful. So last year, we presented three options to look at. Michelle presented speculative page vaults. Sirin presented RCU lookup and VMA lock, which is what we actually ended up doing, and I presented full RCU page vault handling, and everyone said, oh, my God, that is far too scary. We're not going to do that. So here I am talking about what Sirin has done. Since last year, we replaced the RB tree with the Maple tree, and that went into 6.1. Thank you very much, Liam, and we are still dealing with a fallout, but that's... All right, when you're doing something big for the first time, it's tricky, okay. And now in 6.4, it's actually been merged. If you're running 6.4 RC1 on your laptop, you are running Sirin's code. John put up this very helpful write-off of exactly how it works, so I am not going to start talking about how it works. I'm just going to assume that you read the... that you did the reading. All right, so what is coming up? We have patches posted to the main list for handling faults in fileback VMAs with up-to-date pages in the page cache done under the same RC protection that we currently do for anonymous pages. We also have patches for handling faults in pages which happen to be in the page cache, sorry, in the swap cache. All right, so again, that's anonymous, but so it's just extending the current patches that exist. So right now, if we hit one of these two cases, we will fall back to taking the M-Maps M, sorry, the M-Map block. And if you add these patches, then it expands the number of cases in which we will handle it all under RCU plus the per VMA semaphore protection. So what I want to talk about, because that's just patch review, right, and debugging. So I want, or we want to talk about should we wait for IO under the protection of the VMA sem or should we handle it similarly to how we handle the M-Map block today where we make sure to drop the M-Map block before we wait for IO? And a similar question for starting IO. We would like to talk about reading various files in slash proc without the M-Map block. Handling faults in VMAs which use user-faults-fd right now for you, if you use user-faults-fd, we just always fall back to the M-Map block. And then VMAs which are owned by device drivers. We want to talk about removing the M-Map block entirely. And we want to talk about handling faults without the VMA lock. So the thing you all shouted at me last year about, I'm bringing it back. And I have a slide for each of those. If you look in file-map.c in the page-fold handler, you will see that if we would need to sleep, that is we find a page in the page cache and we know it is not up to date. In order to prevent contention on the M-Map SEM, we drop the M-Map SEM. So we take a reference on the file, drop the M-Map SEM, and then we sleep in the page-fold handler waiting for the page to come up to date. And then we restart the page-fold handling from the top. Should we keep this model? I think actually the swap and the file cases are not the same here because in the file case, you can make a good argument that the contention that could happen here would be if someone unmaps the file while they're accessing it. And then that means they have two threads that do conflicting operation and it's kind of expected that one will have to wait on the user somehow. So I think you could get away with waiting while you hold the VMA log in that case without having to release the VMA log while you do that. I think in the Anon case, that argument doesn't really work because VMAs don't really map into process-visible things in the Anon case. And so I see them as a really different case for VMAs. I would like us to have something more granular than for VMA logs or to deal with, find a way to prevent that sort of false conflict from occurring in the Anon case. So I'm gonna restate what you said because I'm pretty stupid about a non-memory. So I want to check that I really understood what you said. So what I think you're saying is that a process can call malloc twice and those two malloc call, and they're big so each one ends up by G-Libc calling M-Map. And those two happen to be combined into the same VMA because we're trying to optimize and save space and memory and time and so on. And therefore, there are conflicts that the application could not be reasonably aware would happen. Yes. Okay, good. I appreciate you letting me have this time to learn. Have you done any tracing to see what, like, you know, keep the M-M lock held for this process and running something like a lock stat and looking at the contentions that may arise. So instead of just this, you know, hand waving saying, yeah, what do you believe? What do you believe? What do you believe? Actually run some actual tests and maybe even under like a production, I don't know if you want to try it on a production environment, but if you're actually seriously doing this, trying to get some way of doing some sort of tracing, there is lock contention tracing that you could do to enable and seeing how much contention there is that causes this to get some real numbers. Look, we have regular bug reports of a unnamed large database product that tends to care about M-UPSM because whenever you just start aggressive monitoring, like PROC-PIT, whatever, that really requires M-UPSM, that can be visible really heavily. Yeah. Into the slide where you get rid of it. That's the last slide. All right. Should we be willing to start IO without the M-Map lock? So the, I think the answer is yes. And it's certainly based on everything that I've just heard about and on being different from far memory. This always sounds like it's a good idea to start it, because we're not saying you can't sleep during the page path and holding the various locks. Of course you can sleep, you can sleep to allocate memory if you absolutely have to. So starting IO, so I think we're still saying you can call down into device drivers with M-M locks held. We're not trying to get rid of those paths, but we're, so there's no reason, we actually, it's actually causing us problems. We should do our best to drop all the M-Map locks before we start calling into device drivers. Okay, good, moving on. Monitoring without the M-Map lock. Michael's problem. Or maybe Facebook's problem and of course unnamed database problem. So this actually is a simple matter of programming, at least for some of them, right? The M-Maps file, we can just do this all under the R-C-U lock today and it is simply a matter of the programming. I mean, since the M-Maple tree went in, so we've actually been able to do it for like three kernel releases now. It's just that there's not been time to write that code. But for the S-Maps interface, this gets a bit more complicated because we actually need to prevent page tables from being freed. So the ways that you prevent page tables from being freed today, you can take the M-Map lock. On X86, you have to disable interrupts. On other architectures, it's sufficient to hold the R-C-U lock, but on X86, you actually have to disable interrupts and I, this just feels like stupid legacy and we should get rid of it and make X86 the same as everybody else and I'll see you free the page tables. So it's not even fully true that disabling interrupts helps because you do that to block the inter-processor, interrupts that flush the TLBs and that effectively blocks the removal of page tables. But there are the para-viltralized variants that don't use the IPIs and then those have to use the R-C-U freeing. The argument that's been made is that the IPIs also are effectively the same, or sorry, disabling interrupts is effectively the same as being an R-C-U critical region. So it effectively does both and it also allows you to make this weird X86 inference that you're blocking IPIs and thus you're blocking the non-R-C-U page table path, which is completely nuts. I've been hoping, Matthew, that your struct page rework will let us put R-C-U heads in all of the page tables, struct pages, so we can R-C-U free everything everywhere for every architecture, but I don't know if I'm just being wildly optimistic. So there's already an R-C-U head in struct page. So I think we can already do this on every architecture, I think. I don't think there's a reason not, I'm being told it's overloaded by some architectures, and yes, it is, but not at the point where they're being freed, I believe. I think at the point where they're being freed, they will never be written to or read. No, no, I know they are overloaded, but I don't think those overloaded fields are used at the point where we're trying to free the page table. Michelle is saying that it's done differently based on different architectures and different config options. Apparently, it actually allocates memory. It doesn't actually use the R-C-U head that's currently in the struct page, which I didn't know, so I've just learned that as well. And to add to this, it will also depend on the config option for split PTL logs. There is also a range there. So in the debugging, that option is enabled, which is enabled by one of the debug options for slab, I believe. It will also add some issues there because PTL is not part of the entry anymore. Mike wants the mic. I think it's feasible to do R-C-U-free of page tables and we can move, we kind of can move the R-C-U-head or create a special R-C-U-head in the page table type union. And then they use that explicitly to make R-C-U-freeing of the page tables. Jason, over here. I think it's a great idea. But there's still this confusing comment in gup that I was looking into last week that says, you still can't use R-C-U here because of some reason it doesn't make sense. I tried to deduce what that reason was and I don't think it makes sense or at least it doesn't make sense today, but maybe somebody knows better. I don't know. He said that Jan Horne explained it to him but now he doesn't remember the explanation. He might be a good person to ask. All right, so general consensus is we want to do it. We just think that there might be demons. I think it would be extremely useful to write that to the mailing list. There are people looking for projects and especially as you say, proc maps seems to be R-C-U really easy, so low-hanging for somebody to look into because most people in this room are really busy so somebody else might be looking for a nice little project. Without having that described, I'm regularly asked to share my to the list. Same here. So yeah, this is a good one to pursue for somebody to take on because it should be easy, but it's going to take testing. Suren. So yeah, I looked quickly at user fault D and it looks very similar to the swap case. So basically we can drop the, what we do right now is we are dropping a map log before notifying the user space that hey, you need to handle this page fault and then we are retrying. So I think the same approach will work with per VMA logs unless somebody can tell me why it wouldn't work, but yeah, maybe Peter can. I would like to see the patch. I think what you said is correct as per our discussion privately there. So I assume what user fault FD does is quite simple that it yields itself and wait for some response that is resolved. So I think it's simpler than the swap. Okay, so once we are done with swap, I think I can apply the same pattern. Fantastic. All right, so one final thing that we probably want to start talking about is handling faults in device driver VMAs under this improved locking scheme. But device drivers may use the M-map lock to exclude faults. So they may be relying on having taken the M-map lock somewhere and then they know that they can't have any faults happen. So I don't intend to go through and audit every device driver that supports a fault handler. One really easy way for device drivers to make use of this is actually to use slash ab use file map fault. There's absolutely nothing stopping a device driver from putting the pages that it owns into a radix tree attached, so into the x-ray that is attached to the i-node i-mapping, and use all of the stuff we have currently available for file systems. I'm seeing a pained expression and I'm not surprised. There may be very good reasons not to do that, but it is one possibility. The device driver is also going to need to attest that it does not drop the M-map lock in its fault handler. I don't believe there are any that do that today, but it's something that they might be doing. Yeah, David. Just like you mentioned something very interesting that sometimes the M-map lock is used to block page faults. What would currently happen if you do fork concurrent to a page fault? Like with the fork that takes the M-map, write lock, block, any kind of page faults because I think like if there's a VM, like would we go ahead and lock each and every VMA doing fork in order to... So you could get concurrent page faults with... Uh-oh. Oh, I suspect that there is something broke. Be ready for surprises. Yeah, so M-map lock would not take a page fault. Sorry, page fault will not take a M-map lock. It will take per VMA lock. So unless you are affecting the same VMAs, it wouldn't be affected by the fork. Like in the general case, device drivers have to assume their VMAs can be forked. And most simple device drivers, they have like some very simple thing that they're doing in the map. And if I fork or I M-dupe or something, I get multiple VMAs. I can't really rely on the M-map lock for serialization. Or if I am, I'm already broken with fork, right? I think that's good logic, David. Please make sense to me. After the per VMA lock work, page fault can happen concurrently, right? And before that, we can't. Before that, it would happen concurrently in multiple processes if you forked. Why not? Before that, the fork takes right lock and every fault handler takes read lock, which will block at the right lock. No, no, no, not during fork, after fork. Oh, after fork. Right, I have two VMAs. Two VMAs, two processes, two M-M structs, two M-M-M blocks. There's no serialization there, right? Right now, the per VMA locking is only done for unknown VMAs. So I think you don't hit that issue with drivers yet. And I think when we want to generalize that, it will have to be on a per, like with some sort of a white list. Why not? I'm just saying, hey, we can do it everywhere. I don't think that's gonna work right away because there are drivers that expect to be serialized on the M-M-M block. It's not just if they're gonna release it, it's like there are some that expect that serialization to happen right now. I have not in detail. I know that Peruv does some really weird things. Like you can M-Map some pages that will hold a log of events. I don't know exactly what they do, but they do weird things with the M-M-M block. And if you just try to have them when concurrently, I don't think that would, you would have to understand what you're doing, which. So the last push on the side is that we may want to be advising device driver writers to implement the M-Map pages function pointer as well as the fault function pointer. So M-Map pages actually runs protected by the RCU read log now as of, I think, 6.3, certainly 6.4. So they definitely can't seep during it, but it is going to be the most efficient way to get your pages into, mapped into user space. So Kristoff made a nice series a while ago to try and simplify the creation of ANON iNodes and device drivers. That if you want to go down this path, it would be probably essential to get that work completed and merged. I think there was some discussion with Al Viro or something and it never, it just kind of dropped it. But currently if you want to get an ANON iNode that's suitable to hang map pages off, it's a pain. Okay, but you can implement your own map pages because you do already have your own fault handler. In that same struct, you have a map pages and your map pages can do anything. It's not, you don't have to use file map pages. You can do anything. Okay, but you were talking about putting it in the x-ray that's essentially. Well, I was suggesting that that is one thing a device driver could choose to do. It doesn't have to implement its own fault handler. It could instead put pages into the x-ray and pretend to be a file system and do far less work all by itself. I don't know any device driver that does that today. It's always, because I'm a device driver. I'm not a file system. They usually don't have struct pages in a lot of cases. Well, it depends on your device driver. I've seen it both ways. Yeah, yeah, I know both ways. Yeah, several cases, okay. All right, well, I'm conscious we're into coffee break. So getting to my penultimate side removing the end map lock entirely. So I've done everything. I think I'll just try to make a point. You may want to unmute. I'm sorry? I saw a little racing hand. I don't have any control over what's going on online. Okay, my question was pretty much dealt with. It was which I know to use because for block devices who do have unique I-Node for all device nodes referring to the same device. For character devices, we end up with I-Nodes of device nodes on slash dev or wherever they are. And you can bring them to Helen Beck. So you would need some mechanism that would give you per device I-Node. And we definitely don't want that for every, I don't know, pseudo-TTY or whatever. It's not like block devices where we would do it uniformly for everyone. Some variation of NONI nodes would be used that way, I guess. I'll need to look through the Christoph's page set again to give anybody details about that pretty much deals with. The question I have. Great, thanks for that, Al. I withdraw the suggestion of pretending your device driver is actually a file system. All right, so that's also a variant. So something that we would like to get to is just to remove all use of M-Map block when handling faults. And I think this is a multi-year project. Then we're clearly capturing the biggest wins first. And at some point, it's going to be kind of like the big kernel lock, right? That it's just, we've got these tail end things that are holding us, getting rid of it, and eventually we will get rid of it properly. Maybe this is not an analogy that makes too much sense to people. I know there's a number of people in the room who came here who was like, what's the big kernel lock? We got rid of it like 50 years ago. But we would like to remove use of the M-Map block to protect the VMA tree. So the VMA tree doesn't really need to be guarded by a semaphore. We can actually guard it with a spin lock of its own. So this is one way that we can go. We can start splitting out the various different things that use the M-Map Sem into their own lock. It's not entirely clear to me what the M-Map block is. It's not entirely clear to me if and how we can do that. There's a bit of tie-in between the reverse mapping locks or the M-Map lock and the Arm-Map locks. And there was a previous item to make the Arm-Map locks a non-blocking, but that was reverse, like that didn't really work out for some use cases. And so I think because of that, because of the tie-in we have between the two, we have to update the two canner at the same time, it might be hard to make the M-Map lock be non-blocking. Yeah, no, understood. I have put very little thought into all the details. So yeah, absolutely. So, plus slide. So handling faults without the VMA lock. So we are at the point where there is no lock contention. But we may have got to a point where there is cash line contention. So if we have a large VMA, particularly like Michelle was saying for anonymous, where we have a very large VMA and we have a lot of page faults being handled in that VMA, that all the threads are bouncing, the cash line that contains effectively the ref count, I mean it's done as an RWCM, but we're using it as a ref count. We're bouncing that ref count around between all the different threads as they start and finish their page faults. So, and this is where we're going to need to use Perth. We're going to need to study, where are the slowdowns coming from, where is the contention, if that's where the contention is, we do have a path. And it's the one that I suggested last year. I have gone through in detail and figured out, can we do it and we can. But there are complexities. Like we have to allocate page tables while we're holding the RCU read lock, or we have to drop the RCU read lock, allocate the page tables and then come back. And so I've been talking with Paul McKenney about how we might be able to do that in a more efficient way. It's kind of nasty, that we didn't come with any nice ways of doing this. Yeah, and then we start to go through all the different things that we're doing right now under VMA lock protection. We look for our lowest hanging fruit. So being able to allocate in certain new and non pages. And again, we would have to say, well, we're only protected by the RCU read lock at this point, like allocating memories kind of hard. So do we do a GFP no weight or do we drop the RCU read lock and restart after we've done an allocation? Yes, Stephen. I just want to real quick, I know we have coffee and everything that, but are you familiar with the runtime verifier or run time verification that was? I briefly skin the patches. So basically the idea of it is it hooks into the trace events and you create a model. If you want to actually follow the correct, you actually define the model that you want. And if it ever goes outside that model, it could panic, print, warn, whatever. So it's something like this, if you want to make sure you assume that everything's happening correctly, if you have the correct trace points in there, you could develop the model, put it in, and you run it, it's made to run on production. Two as well. Just want to let people know that this exists. That's cool. Thanks, Stephen. Yeah, so when we get to this handling falls throughout the VMA lock, we then have three or four different ways where we might be in different contexts, where we might be in a fold handler, we might be holding the M-Maps, we might be in pure RCU mode, we might be in RCU mode, but actually protected by the VMA lock. And so there's all these different fallbacks and scenarios and it does start to get quite complicated. And I think, last year I was informed the complexity scares people and I hear you. So we're only going to go this far if the performance warrants it. Any last thoughts? Sounds like coffee time. Oh, Michael. I would like to, for doing Forge without the VMA lock, I would like to do that for Anon because of what I was saying before that the per VMA locking is adequate for FI's and for Anon it's a bit more questionable. Like there can be false conflicts that are kind of unavoidable with Anon. So I think avoiding taking that lock for Anon would be good because of that. Yeah, and it's totally possible to get that building from where we're at with Soren's Patch. We can get there. We're just going to need to bring data to show that the complexity is worth it. Thank you.