 Thanks, I want to talk a little bit. I was originally going to talk about memory error handling, but a few topics have come up already in the memory management sessions, which kind of overlap with an issue that I'm currently having with huge TLB PMD sharing. And I put up there that yeah, this technology is almost 16 years young. So it's not really something that we would normally talk about in LSFMM. But it wasn't something that I was actually aware of until I started looking deeply into the huge TLB code. So just as a brief, very brief overview, processes can share PMD pages if they do huge TLB shared mappings. And as I said, we actually share an entire PMD range, which means that that's a pod size sharing, which is usually one gigabyte on x86. And one of the reasons for doing this, it wasn't really explicitly the reason that it was originally done is memory savings. And one of the examples that our database group likes to give is, you know, a one terabyte shared mapping, they have 10,000 processes sharing that mapping. If you do the math that comes out to about 39 gig worth of memory savings, if you do PMD sharing. So just at a very high level, this is only a four to level page table but this is kind of what it looks like you have process a process B. They both point to the same PMD page and that's where the actual data pages get addressed. So you set up sharing like this. There's also this routine called huge PMD on share. And what happens is is that whenever you change the attributes like doing and protect read only mapping or a range of a mapping, we have to on share that shared mapping. And so what happens is is that we actually clear the PMD pointer in the PUD. So we effectively on map a PUD size area. And this happens, like I say, whenever you change protection, you do truncation on a file or whole bunch of file, or maybe you know the file just gets on a page gets unmapped in that shared section for migration. And so when we call huge PMD on share, it has to hold the page table lock. That seems kind of obvious since it's messing with the page table. And it also has to hold the IM map SEMA. And what that is is just the sharing is keyed off of the ref count of the PMD page itself. That's kind of a very brief overview on huge PMD sharing. So a couple of years ago, I stumbled upon this kind of very ugly race in the code. And what happens is is that usually during a page fault or various other places, you will look up, do a page table walk to get a pointer to a PTE. And or maybe even just do in the page fault code, allocate a PTE, you'll get a pointer to that thing. But the problem is, is that another thread could theoretically at the same time be doing a huge PMD on share, which means that that PTE pointer that you just looked up or that you know PTE that you just allocated. That is no longer valid. It could more than likely point to the page table of a process that you were sharing with or worse yet I guess it could even theoretically be unallocated and point way off somewhere else. So using that PTE pointer, even to do a page table lock, because it is kind of what you key off of to get the page table lock. That's invalid certainly writing to that PTE at that location is invalid. So that looked really scary to me. And once I knew that that was actually possible. It wasn't too hard to write user space code to actually hit that race condition. So that scared the heck out of me a couple of years ago, and resulted in this commit that's upstream today. And what this commit does is actually uses the IMF SEMA for synchronization. So as I mentioned earlier, you have to hold that semaphore in right mode when you're doing an on share operation. So my thought was, and this is what is in the code today is is that we hold that same semaphore in read mode during fault processing, or any time that we have to look up, do a page table walk, and look up and get the address of a PTE pointer, and use that for some purpose, you know, writing to it or doing something with that. The good thing about that is that faults can run in parallel. The huge PMD on share is blocked during false because it does require right mode, and all is safe, but the problem is is that that that that IMF SEMA is held for the duration of truncate whole punch. But maybe even more importantly, unmapped operations. So that can cause quite large delays in fault processing, and even more significant, I think, is is that, you know, process X, which is unmapping the shared area can cause delays and processes why page fault handling. I actually pointed out by on the mail this not too long ago, and I thought, Well, how bad can this really get so I took a test system with 48 CPUs, 320 gig of memory, and just have a very simple process that does in an infinite loop and maps a 250 gig file as shared mapping faults and all the pages and unmapped it have 48 instances of that running all in parallel, and run that and see, Well, what would be my worst latencies for faults out there. And as you can see from this slide, running that for a couple of hours, I end up a worst case of over a two second latency during fault processing waiting for that IMF SEMA lock. So, not really pretty. I started talking about different ways to address this upstream. The only feedback that I got from was from David and I think David said, Well, let's just disable huge PMD sharing which some people may not really agree with. The approach that I've come up with is something that was actually discussed a little bit when we're talking when Liam and Matthew were talking about MAP SEMA. So more scalability there. And so what I'm doing and is actually have some patches prepared to do is actually revert that patch that uses IMF SEMA for the synchronization. And this may sound familiar to that talk is actually add a per VMA semaphore for this PMD synchronization. And this would be huge TLB specific, I would hang it off of the VM private data field so it's not nothing that would be added to VM struct itself, and only added to VMAs that are capable of sharing. And just like how IMF SEMA is used today in upstream code it's held and read mode for false and write mode when calling huge PMD on share. The good thing about this is it actually limits the scope of contention in some respects, still things like truncate hole punch that can be done by anyone and migrate as well. When you do an unmap is only, you can only block threads in the same process in the same MM struck the same address space. So, just some quick comparisons as far as that same benchmark test week. If you'll notice here, our maximum fault wait time goes down for him, you know, two seconds to, I don't know, a little bit over a tenth of a second here. Our overall wait time isn't significantly less but our worst case seems to be quite a bit better so I was just curious about thoughts on this approach. I know that during David's talk about page table reclamation. The whole issue of trying to keep page table pointers stable while you're doing something like a page fault has to be addressed with I haven't looked at that series. But it seems like they're using, you know, sequence counts or something like that to keep that data stable. That may be overkill for this here. I don't know just Monday, get some additional thoughts on this. No thoughts. There's a lot of eye rolling in the room like I think you've broken a lot of people's brains here. I think Liam's hyperventilating in fact, there's one more kind of interesting data point here is that, you know, the upstream complaint was about this maximum fault wait time or actually they weren't complaining about a specific number, but it was another database vendor who was using huge TLB for shared data. And they just noticed latencies went up when they went with a more modern kernel. I'm thinking it's one with that new change in locking. So I went back to see, well, how bad, or how good was the scalability originally. And I just put this together last night so I'm not 100% confident in the numbers but if I just revert that and go back to the we're not safe where, you know, we could potentially corrupt data. The interesting things is that the maximum latency during fault is even worse than it is today. One of the very interesting things is that we actually wait were stalled in fault processing, you know, quite a few times less basically every, you know, almost every time, you know, we're talking, you know, to 100,000, times more in the code that is today as opposed to the original. So I don't think, you know, I still think we have the potential for, you know, we had the potential for these very long fault delays in the original code. And I guess that we had less opportunity to actually wait. And that, and that can be explained just because of the area of the locking we only did that when actually doing a PMD Alec and noticing that there was not a PMD page there and then had to actually kick in the locking to to set that up. So that's all that I had like says hoping for a little bit of comment. When David said tear out the huge PMD sharing I said well okay we'll just use M share instead of that but after this morning session I don't think that's going in next week or anything. What type of lock did you put in. Excuse me, what type of lock did you put in, and where did it where did it live again. So what I did so this is, Liam this kind of the it sparked my interest during the talk about the MAP semis scaling that you had earlier, maybe on Monday. So I actually added a per VMA lock. What type of lock. It's just a read writer semiform. Yeah, those will spin though. We're, we were thinking of an invo, we're thinking of flag or something for in not invalid inactive. Whoa, what just. We're thinking of putting some sort of flag in the show that it was in what was in the middle of an operation. So if you did that it might actually, is he gone. Okay, I'm here. Yeah, sorry the screen went blank. Sorry. So we were thinking of putting in a flag that would show that there was some sort of activity on the VMA and you would restart you try for anything else. If you use an RW sem, I think those things will spin and it will spin if there's any activity. So if there's any activity in fact it will get worse. So it might be you might be hitting that that making affecting your results in a negative way. You might have better luck with a different type of lock locking strategy. I don't know. I mean, it's the right lock to use. If the RW sem implementation is bad then it needs to get fixed. Well, it's, it's not bad. It exists for, for a reason. It's just not for this. Well, I don't know. I think the way he's using it is fine. My, my point, the point I wanted to make was that I kind of cringed a little bit when I saw, oh, yeah, we do huge PMD unshare when you do MPRTEX. Because I got thinking a minute, the database people said to me that what they really want is for MPRTEX to affect every process that is accessing the database. Yeah, that's what they want. They don't have that with the huge PMD sharing. So if we had the magic flag that we talked about in the earlier session that said, no, really I want you to share the page tables. I want you to share the MPRTEX information. And I think that's a good thing, as if things were multi threaded, but they're not for reasons. How much of this problem goes away. Yeah, I think most of it would. That's of course without looking at all the details and I think the suggestion was is to actually store somehow reference those page tables based upon your sharing object like the I know it or something correct. The model that we discussed yesterday where we have a flag in each VMA and you just use that with that work. Well, because this is transparent, we still have to have user space opt into opt into it because we can't just say, oh, yeah, we transparently shared the page tables and now MPRTEX affects everyone who has it and that without everyone who had it and that opting into believing in that. So if we keep it the way it is, and Mike's solution gets changed to use the RCU way of reading things and inactive flagging this particular special VMA. I'm not sure I can't picture it off of my head I need to think about that for a bit longer. Is David in the room I mean one of the things that like I say I haven't looked at those patches yet but so is it sequence counting that is used to keep the page table in a stable state while somebody else could be reclaiming or removing part of it. It's not really for when I was talking about the sequence lock the sequence lock is only used to synchronize for example in our for code against our get user pages code and it's not for synchronizing kind of page table activity. It's just to tell somebody who operates on the on the page tables to try again like to undo whatever you did try again. There was something in between for page tables especially under a good fast code. We actually use I think there are like the two variants either via RCU or via via to be flush where you can synchronize against that. So what you would do is after you you perform your changes. For example if you would want to rip out a page table you would rip out a page table not free yet you would do a TV flush and then you can free defer it because then you know like for example fast is no longer walking over it. If you need something similar the question would be like can you also defer for example some kind of freeing after that. But I mean the sequence count could eventually work just to tell whatever was walking and doing stuff to try again because something changed. But like the problem of access after free is a different story the sequence count is really just to tell somebody to try again because something changed. Yeah. Yeah, I think there I may be able to do that with some kind of a sequence count as well. What happens actually like if the last user of a shared PMD goes away because like the last one called PMD unshared will the page tables then get freed or how exactly does it work. Well it's actually when the second to the last user goes away, then basically it looks like just one person with their normal page table setup. Okay, so my gut feeling is that maybe good fast code isn't probably synchronized with that and we might have some weird issues already. That's my gut feeling but we should look into that. But yeah, I'll have to look into the details but maybe reusing even the same sequence count or something like that could already work and fix some other unexpected cases. Okay, well, that's all I have but I'm going to probably send out this VMA locking as another RFC. Just to see what people think about it. So thank you.