 It looks like we pretty much got everyone. My name is John Hubbard, and I've been working on a set of APIs called pin user pages. These are variations on get user pages. So I really want to walk around, but I'm not supposed to, right? So today we have a few minutes to get MM and file system people together in the same room at great expense, and we don't have much time to talk about this, so I'm really excited that we were able to have a moment to go through this because it's important to get both perspectives on this because this is an API between MM and file system, and it's had troubles. So what I want to do is kind of get to the last slides quickly. I've got a few slides to talk about the original problem, and one slide for some of the early solution approaches that we had, and a couple slides to talk about where we are today, and a couple of pictures for illustration purposes just to show you some of the call stacks, and then a couple of slides to discuss the actual solutions. So I'm going to try not to linger on the initial slides. So what I want to do is try to just raise awareness of the problem that we've got, plan out a general direction for the solution so that once we get past the current to do items, then we actually have a solution. Now, this has been somewhat controversial in the past, so I've tried to set this up carefully and not say certain words like file leasing until I get to the end. Here's a little background on the original problem, and young car I had a really clear write up that I keep going back to. It was kind of fun because I was sitting at a lunch with LSFMM when NVIDIA tagged me with a customer bug that actually had the same backtrace, and so I was literally talking with people about it, and they were telling me, yeah, this is what's going on. This one is, there's kind of two variations to this problem. This one's especially helpful because it talks about direct IO, which is less obvious at least to me than the other way you can get it. The other way you can get it is by pinning pages with user pages and then using a device like an accelerator and then marking the pages dirty with set page dirty, and then releasing them, and that will lead you into this, but that's kind of obvious. Whereas this one is direct IO and you're not involving a device, well, there's a disc, but you're not involving a device really, and you can still run into trouble. We can come back to the slide if necessary. It's got a lot of stuff on it, but there you are. Here is a summary of what that slide is actually showing you, which is that page pinning is invisible to the file system. The file system, if you look at the bottom box, it expects that if you say write begin and then do some IO and then write end that as long as you don't have mappings, as long as the pages aren't mapped somewhere, the file shouldn't, the pages shouldn't have any changes to them outside of those brackets, same with page make write or write page. The file system has a particular view of the world that you have destroyed by calling either pin user pages or get user pages on that range because what that does is it raises the reference count. If we go back to the last slide, you'll see the details of why that all falls apart. I was really shocked to find out that this was not only a problem, but a long-standing problem and that it had just sort of slipped through the cracks because it's pretty big. The fact that MM sees the world one way, which is you can just, you know, we move pages around, you can pin them, they come, they go, and file system sees it another way, which is, you know, here is the rules for writing to files or pages, rather, and it would just never the twain shall meet, so anyway, all right, so we chatted about it when we came out, came up with some early approaches to how this should be solved and after some discussion, we figured, well, let's provide a way to identify pin pages and then use that information, plus maybe some other stuff that we're not sure about yet to decide what to do when page reclaim or something finds a pin page, so to turn that into a little bit more detail, we came up with pin user pages, which is, it's kind of like, it's the same as if you called get user pages, but you pass in a follow pin flag, that's why we keep saying follow pin around here, so I put a little note there about terminology, if we say follow pin, pin user pages, DMA pin, that all means pretty much the same thing roughly, and the output of that whole thing is a function called page maybe DMA pinned, which makes me unhappy because it's got a maybe in there, and I keep trying to turn it into one that is not a maybe, but then people tell me that I can't have that many counter bits, and so we're stuck with maybe. Although, if you get rid of all the other fields in the struck page and you're down to just like a couple, then you could put a couple of counters in there. I do think there is a path to getting you some bits in the base struct folio, I'm not going to guarantee you anything about struck page, but for struct folio, I am planning on, I think maybe you talked about this already, I'm planning on getting map count down to just a few bits, and so the current field that's being used for map count, we can also use for pin count, but you can't have perhaps as many bits as you might like, we might need to go to some kind of saturating scheme or something says, yeah, it's pinned, but we don't know exactly how many times, is there some kind of way to find out how many times a page is DMA pinned, even if it's quite expensive? Is there some way to find out how many times it's pinned? Yeah, so instead of keeping a precise count, can we keep kind of a sloppy count and go back and figure out, oh no, it actually is zero now, I know it was high, but now it's really quite low, and maybe it's actually zero. Oh, so kind of a right combining counter. Something like that, yeah. Probably not, because it has to count up and count down, as people come in, or as callers come and go. Yeah, maybe we can keep some kind of side data structure and figure it out. So you just look in the page and you say, yes, it's definitely is pinned or it's definitely not pinned, and if you, because a side data structure would solve everything, and we haven't gone there because, oh, everything's in the page and performance, but maybe if we're pushed hard enough and we need a non-fuzzy result, maybe it's time to just start scribbling, you know, data structures, a bunch of them, one per page on the side. It's been, oh. No, we don't do DMA things that often, so direct IO stuff that often, and direct IO takes quite a long time, so making a side data structure sounds like good idea, so you've just got, it just holds one count on the page, I don't know whether you've got a flag to spare to say this is pinned, but then if it's pinned, then you can go and look in your cache of the side data structures, I doubt there'd be that many of them on the system at any one time, but the file system could just make one, say. And, yeah, we looked at page owner, which is kind of a variation on that, but then you have to be configured for it, but that's like a bad version of a struct on the side, I guess, so never mind, but yeah. But having the side data structure with the actual count in it, and just a bit in the fast path, because the DMA operations are slower, they're not always a fast path operation. I know that's not always true, but it's done less often. And so you can, before the extra dereference to go into this other data structure and do your counts there, but then page reclaiming other things can just, they just say, oh, it's pinned, I'll leave it. Or, that's a good point, and I don't know why that's been overlooked. I think, yeah, I'm just thinking back about how that went through. But, yeah, that could really work. And then just, you only go off and find this other counter when you need it. One thing it would break is the existing uses of this, because copy on write has already jumped on this. So we may end up leaving this in place and then adding a side data structure. So, like, I'm wondering, like, for, for pages that are in the page cache that you care about pinning, in which scenarios would you actually run into false positives? I think the biases right now is something like 1,024, which means you would have to get 1,024 references on a page until you get like a false positive. How could that happen? Think, think about a page from libc, if you're running 1,024 processes, you've got it mapped 1,024 times. Okay. Because like the point I'm trying to make is like for, for, for anonymous pages, I think I can get them maybe out of there. I can special case, for example, if a page is exclusive, I know it's never going to be pinned, and I don't have to care about that with the new semantics that I introduced. I was wondering if you can come up with something, something similar that we, we tweak somehow, for example, the bias in a way that it is different for fire pages and for other pages, stuff like that, that we can special case based on the page type, which bias might actually be preferable or not. Maybe that would give you like more bits to make it less likely to happen or something like that. That, that sounds like a good idea. I'm a little bit lost. I didn't quite follow where, where my question to Willie was, when would you ever pin a libc page? Oh, I see. I mean, maybe for VM migration or something. I can definitely answer that. I mean, if you're doing something like, I know we want to rename it, but HMM, which is basically migrating pages back and forth between an accelerator device and, and, but you would do that with a libc code page. Um, I forget the page of libc is mapped so many times. It looks like it's DMA pinned. And so, Oh, okay. You're saying it overflows into the pin. Yeah, exactly. Oh, okay. That's a different issue. Um, okay. So I want to try that's, that's good. That's very helpful. Let's see if we can get a little bit further. Um, let's see. Okay. So this is just status. We'll go through this kind of quickly. Um, the, one of the things we learned while, um, converting various callers, a call site is something that calls get user pages is that basically if you're touching the page contents as opposed to something in struck page, usually you want to use pin user pages. Um, otherwise you just get your pages. Um, so most, most things are converted except for the file systems, uh, which appear to need an all at once conversion. I'll, I'll show you a slide in a minute. Um, and let's see. Is there anything else to mention on this? Oh yeah. Uh, uh, the low level note at the bottom, I thought was interesting because while I was converting a lot of these callers, I noticed that they were, um, they were, they were doing that pattern. They were calling, uh, you know, pin the pages, do something, set pages, dirty, unpin. And so I factored in it, you know, cleverly factored into unpin user pages, dirty lock and converted a whole bunch of things. And now it looks like the pattern itself was wrong. Um, because that's basically how you get into this bug is your calling set pages dirty outside of file system knowledge. And so you've, you've destroyed the world. I can't see how that can ever possibly be right. And yet it was all over the place. And so now it's cleverly converted into a central color function so that all the evil is concentrated in one spot. Um, so, uh, the second page out of two on status, uh, uh, ext four has a workaround now that avoids the original crash. So, um, Ted, so commit, uh, right there, uh, and that, that's pretty convenient. It still leaves things corrupted, but at least it doesn't crash the kernel. So it just says, okay, you know, page buffers are missing. Don't, don't try to use them. Um, and then meanwhile, um, this copy and write thing, which there's been articles about and a lot of discussion about, I'm sure if you want to see that go by, that's actually using, um, page may be D may pinned to help make a decision about what you should do when you're forking in, in doing a copy and write. Um, which is why I say that the side structure may not work or not. It can work. The side structure may have to be in addition to what we already have. If we want this to continue to work. Um, this shows a few of the, uh, selected call paths. I just very carefully picked the things that I wanted to see. Um, so I could tell the difference between, you know, what's doing IO map, what's doing get page versus get user pages. Uh, what's funneling through the IO, VEC, uh, um, I'll, I'll be either get pages, paths and what's not and, and what do I, what do I have to convert and what do I get for free? So I have a small patch set right now that goes and converts, um, some of the stuff near the bottom. So, uh, you know, instead of IOV get pages, uh, I'll be either get pages, it makes the conversion. Um, and that automatically takes care of a lot of things. I think it takes care of all the things that are calling IO map or that are using IO map, but it leaves a lot of things not fixed. Like a bunch of the network file systems, um, and just a few stray things there that kind of blow up in your face when you look in there. And, and the reason, um, let me go to the next one and just show you, uh, at the end, when you're all done doing IO and everything, you, you call bio release pages for the, for the direct IO thing. And I just showed this because everything funnels into that, but you can't, it's kind of like you can't get there from here. You know, you toss your, your bio submit in at the top and it goes through a bunch of machinery that you believe me, you can't, you can't do much about. And at the end, um, you've lost all track of everything except, okay, your pages come back and now you can call bio release pages and bio release pages at the moment is calling put page. That has to be converted, um, because it has to be converted to unpin user page. So now that that's converted to unpin user page, that means everything that put it in on the top up here on the submission path had better have called pin user page instead of get user page. So after weaseling around for months, poking around at ways to, you know, do an all, uh, an easy conversion, I find that some of these things are going to have to be, um, converted, you know, go into that file system, sort out which things are doing a get page and which are doing user page and, and keep track of those. So the call site needs to know the difference between pages that were acquired from, um, get user pages versus get page. Uh, so since it's all these file system people here, I just want to point out that if anyone were so inclined to go, you know, factor things appropriately for some of these file systems, it wouldn't, wouldn't hurt my feelings at all because that's tricky for someone who doesn't know the code. Okay, so here we have, um, just a couple of slides, one to help you visualize where we're going. So this is the original problem with a little file lease box on top, um, and then just keep that in mind and we'll go to, uh, a list here and we have, I guess 12 minutes and, uh, in those 12 minutes, I want to walk out here with a completely, uh, finished design for what to do after, uh, all the conversions are done. So imagine that we've converted everything that's supposed to, to call pin user pages and that's felt very nice. Now you know what's pinned, um, but so what everything's still broken. Um, so the proposal here, um, which is not something I invented, um, is right in the middle, um, require a file lease page, uh, require that someone for this page range, you have to take a file lease on that range on the address range. I guess you could call it. Before you're allowed to call pin user page or, uh, uh, with the opt-in idea, perhaps before you're allowed to call pin user page, when you pass in follow lease, which is, I just made up, um, the advantage of this is that it's a correct solution. It connects file system and M M, you know, if you're, if you're going to go work on these pages that are associated with a file system, you have to connect it to the file system somehow. And this is really the, the only proposal that I remember hearing that clearly solved it. Um, so I'll throw it open to comments, discussion. Do you love it? Hate it? Is there something better? Um, and how many people will it take to get IRA off of CXL and onto working on file leases? Wow. Yes, we are there. I was just going to say that leases were hard. There was a lot of roadblocks that I don't even remember anymore. Um, but I, I agree that, you know, the file systems need to be more made, made aware of this and there's need to be communication, but the file systems don't always like to let go of their pages. Yeah, what do you say you got a file system guy behind you? Yeah. So I mean, I don't care about that solution. I can imagine some people being potentially worried about it, right? Because what's going to happen is you're, we were basically telling the file system that all of the pages in this one gigabyte range may be marked dirty at some point in the future. Now, what does that mean? Well, it means that if those pages don't have blocks allocated to them yet, we're going to have to allocate blocks to all of those pages. If it's a reflinked file and we're going to be potentially doing copy on right when the pages aren't dirty, then the file system is going to have to do a copy on right operation on that entire one gigabyte range, even if none of the pages actually are ever dirty now, but they will be. Believe me, yeah, if you did this, you did it because you're going to scribble on the pages. Yeah. If the use cases, one where eventually they all will be dirty in any way, then, you know, doing it, you actually want that behavior. At the beginning, great, right? I actually have, I have actually implemented already what you want for network file system. And then I've abandoned it because Truncate and also direct.io and a couple of other things. It's actually really hard to get right. So I've implemented it. I kind of have it working. If you want to look at the code, I can point it to you and branch with it. Please do. But it's really hard to get right. And I don't care about Truncate. We shouldn't care about Truncate. Unfortunately, some people do. And it gets even more fun if someone's doing right and then someone does an appallocate to take a hole out of that. So you might end up squidging your stuff down. Plus, when it comes to network file systems, you have to deal with bits of your range that have different authentication things on them that you end up having to merge and stuff like that. So yeah, I can give you a code that does this. I thought it was just too complicated, but you all come to have a look. It would be good to see it does not sound promising. So I might not fully understand the leases that you have in mind, but at least for ButterFS, it's not just, the set page dirty isn't, is a big problem, but it's not the only problem. The reason why we care about set page dirty is also because before we do IO on a page, we basically lock it down so the page can't change anymore, because we take a check sum of the page. And it's kind of important, like if we check sum it, that it doesn't change after that, because we want the check sum to match, right? Well, one idea, so I got involved in this because I do RDMA or did at one point, and then I had problems with PMEM. One of the use cases that we see is for RDMA, and if RDMA is DMMA to the page, you probably shouldn't be writing it back anyway. And so one process could be, user space has to lease the page, they pin the page with their memory regions, they do their IO and they're doing whatever, and then if they want to actually F-sync that page, that they have to release the lease, and then that turns it back over, releases it from the MM, turns it back over the file system, file system does their thing. I don't know if that would be performant for all RDMA apps. That's a lot of overhead for RDMA apps, but maybe, I don't know. This shit sucks, right? Like, this is the biggest barrier that we have to pagecast sharing because the fact that MM can just kind of go behind our backs and mark things as dirty is a huge problem for us. So I don't particularly care if the solution is onerous or whatever, it's got to be fixed. And then that's to say nothing to the, like the ODirect, user space can just change things because like I straight up have to tell users, like yeah, don't use ODirect with Windows VMs because your file system and corrupted because the checksums won't match, you have to turn off data checksumming for these virtual images, right? These are terrible user interactions, user experiences, that's what I'm trying to say, that exist purely out of our own doing, right? Like, we control all of this, we're the ones that set the prop, like how these things work. I'm not super thrilled with the idea of saying, okay, this range may become dirty at some point in the future because ButterFS has a lot of scaffolding that has to be set up for a dirty range. We have to like reserve space. And only that was like, we have to be able to write that space back in order to reclaim space and make forward progress under like, you know, space conditions, right? So like being told that this area may become dirty at some point but there's nothing we can do about it until the lease is up or whatever, it's like puts us in a bind, right? That's just something that user space has always been able to do and I think that's something that maybe not everyone thinks about because with DIO, you think about DIO being to and from the file of the script you're bypassing the page cache but the corner case is that the buffer that you're doing direct to IO2 or from can be an M mapped file from the same file system or from another file system and user space can actually cause potentially interesting deadlocks if your file system doesn't handle this because the buffer can be M mapped from the same file that you're doing DIO2 or from. So I'm also wondering how is this gonna affect the DIO fast path if we then have to get a lease in order to pin a page first? I think the important thing is that that's a really damn rare thing to do, right? I mean, most users don't do that. Most of the times you see that being done is a test suite. Is it a test suite, really? It's definitely not something that user space normally does but I think adversarily. Yeah, and like IO is, because there's a very special path. We have a special path specifically for this problem, ButterFS, like and it's very poorly tested until we started running ButterFS more widely in production and suddenly we were tripping over this all of the time and this, because the only way I know how to trigger it consistently is M-MAP DIO write into an M-MAPs region or DIO read into an M-MAPs region. Like I can trigger this code path every single time. I don't think we're doing that in production but I can trip this code path consistently all of the time, thousands of times a second. So somebody is doing something in this area. I have yet to figure out what it is. That's what the original bug report was. DIO read, like this is how I reproduce problems is like I know how to trigger the path and simply DIO read into an M-MAPs region. Yeah, the original slide at the beginning was that. Yeah, so like that's my test case and I'm an XFS test and XFS has to test the deadlock because we have to do all of this like, oh, somebody dirty the page and didn't tell us. Now we have to like have an async worker thread that goes and does all of the things that was supposed to be done at like MK right time for us and then we can submit the page for writing. I think fixing this may cause performance regressions in some cases but on the other hand, if it wasn't correct to begin with, your performance was an illusion. Right, yeah, then this is kind of where I go back and forth, right? Because, you know, Blutter FS is, you know, EXD4 doesn't have this problem, XFS doesn't have this problem. Like they don't do data check summing. So like nobody ever notices but Blutter FS notices because your Windows VMs are start throwing EIOs because check sums don't match and like Blutter FS is paying the price for this all over the place and my approach has been, you know, play stupid games win stupid prizes, like turn off data check summing for this or whatever but that's clearly a not great solution but in the end, we have DIO specifically to be the fast pass. So we are kind of letting our users run with scissors a little bit. So should we do all this work to make it better? Not, I don't know. I certainly would like it for things like page cache sharing and other stuff but. Well, that's interesting. So you say, should we do this? So I guess you're thinking that leaving it as is is gonna work out. I'm not necessarily saying that. I'm just saying that if the solution is all of a sudden all DIO for applications that are behaving properly suddenly tank, that's not a valid solution, right? Right. So Joseph, I just wanted to say it's not just Blutter FS that suffers from this problem. It's any file system on top of RAID 5 because you've done your parity calculation and now it's wrong. Oops. Without RAID 5, double buffering, right? That would be. I just got a double, but whatever it is. But one of the things that we could do is maintain the fast path for places where the check summing or for whatever reason the pages are allowed to change while in flight. File systems can intentionally turn it on for specific cases and then move over to leases for our common cases. Yeah, because I mean like Blutter FS is clearly not winning like the speed award for this path anyway. And so like I'm willing to eat it just for correctness for us. And we already have like the stable page helpers. So you can just be like, okay, this file system requires stable pages. Do the lease thing and otherwise leave. So that follow lease on the flags probably would help you then. So if the call site passes in follow lease, it says I'm ready for the new behavior and if it doesn't, then it's not. Yep, I mean that would be reasonable for me. Okay, I'm out of time. So I will let Ira come up. I can do next.