 All right everybody, we're gonna go ahead and get started. I'll let y'all sit down. So just a few notes. First of all, I'm Joseph Bassick. Welcome everybody. It's been a few years. We finally made it happen, so hooray for that. I wanna go over a few things. Clearly this is a bit weird for us. We're on Zoom call right now, so we have remote attendees for the very first time ever, I think, for this conference. For anybody that's presenting in this room, stay right here. This is where the camera's pointed. This is where the microphones are. So if you wander down the stage, the people in the Zoom call are just not gonna hear or see anything. The microphones are on the table. Please use them. We have this every year, but this year it's particularly important because again, people on the call will not be able to hear you if you're just yelling at each other. The other fun thing is that this is being recorded and will be put up on the YouTube for Linux Foundation. So people like me need to watch your mouth. So this will be a very fun three days of seeing how long it takes me to start swearing into this thing. Other than that, I think everybody has everything. The schedule, I've gotten a lot of questions about. It was in the email. It's on the website. If you go to the website, you click on schedule. You can see the schedule. It should take you to Google Sheets, which will be changing constantly as we move things around. It has all of the Zoom links. So if you have to like bug out and watch from your room or whatever, you can click on that. The Zoom password was sent to your email for both in-person and virtual attendees. If you can't find it, you don't know it. Find somebody on the program committee. That is me. Omar right here. I don't have my glasses. I can't see anybody else. Why don't the program committee stand up? So you know who you are. So you got McCall. Got Dan Williams back there. Alexi's around here somewhere. Those are me here. Alexi, Daniel. So these are your program committee members. Thank them profusely because they've been doing this for three years. We've been trying to do this for three years. These guys have not, we've not rotated. So we've been doing it for this long. We are done. Maybe they're not, but I'm for fuck's sure done. So. I thought you left the five minutes. Yes. I want the full. God damn it. Anyway, so yeah, this has been a long time coming. I am super excited to see everybody here. I am super happy that everybody has managed to stay safe and relatively healthy. That is awesome to hear. Also Martin unfortunately had an emergency. He is hopefully on the Zoom call, but he's okay, but he's obviously not here this week, but he'll be on the Zoom call. Other than that, yeah, let's get started. I think BPF is in their own room. So BPF guys can leave if they want. And Willie, you wanna come talk about folios? Thanks Joseph. All right, I wasn't expecting this to be got from YouTube, but oh well. So. I was originally planning on skipping straight to the so future work slide. These are actually the slides I'm going to be using in Austin in three months time, but I'm just gonna, I'm not gonna do the talk I'm doing there. I'm just gonna skip around and use this to illustrate to primarily file system people, I think in my audience for this, but because file system people don't know about memory management. And I asked a file system person earlier today about compound pages, and they said, is that like a compound fracture? Yes, multiple things are going wrong at all at the same time. They are exactly like that. But yeah, seriously, this is what a, when you say to the VM, hey, give me a compound page of order three, this is what it looks like. Each, it allocates eight consecutive pages. And each of the last seven pages all point back to the first one and say, hey, that's where all the real information is that you need to know. And so when you call something like lock page on a tail page, it actually locks the head page. And then you start asking questions about things like, well, okay, so is this page up to date? And it doesn't actually know the up to date status of each individual page. It actually knows the status of the entire order three blob, the compound page. And so you can already hear, I'm having trouble talking because I'm using page to mean two very different things, right? There's page size bytes, which is 4K or 8K, whatever. And then there's the size of this, whatever this page happens to be. And so because I'm having trouble talking about it, that means that we need a new term. So this is why I came up with folio, right? Anyway, I kind of skipped over the why are we doing this? We're doing this because we really want to be able to manage memory in larger chunks than 4K, right? You, on your laptop right now, you've got what, 16 gig, 32 gig, right? That is millions of pages. That's millions of four kilobyte pages. It is a pain, see Joseph? That's how you don't swear on YouTube. It's a real pain to deal with that many pages. We really need to manage memory in larger chunks. We are wasting so much time, we're wasting so much energy, just trying to understand, oh, you know, I'm moving all these pages around onto these lists and that list and it's stupid. We just need to manage memory in larger chunks. But in order to do that, we need better infrastructure. We need the folio. So this is, I'm just kind of talking about the, some of our bad interfaces, right? But fundamentally, if you see something that takes a struck page, you don't know what happens if you pass a tail page of a compound page to it. We pass a head page of a compound page. We have all kinds of functions, all kinds of crazy things and you start asking around, well, what is this supposed to do? And nobody, chances are nobody actually thought, well, what if this is a compound page? Because for so long, file systems have been dealing with order zero pages. There haven't been, other than for TempeFast, there has just not been the case that file systems see compound pages. File systems have always seen order zero pages. So, we have a new type, struck folio. And a folio refers to the head, it is essentially an alias for the head page. So, if you see a struck folio, you know you are talking about the head page of it. So, you don't need to say, oh, is this a tail page? If it's a tail page, go and get the head page. You just forget all about that. You just use the folio interfaces that exist. And they do now mostly exist. I've been kind of dribbling them into the tree over the last few corner releases. And I won't say we have complete coverage yet, but certainly as I go and convert more and more code to use folios instead of pages, there's less and less places that I have to convert first. I can generally go to any given file system and convert it from using a struck page to using struck folio. And all the interfaces that I need already exist. This is kind of a timeline of when stuff went upstream. I should probably do a 519 with all the stuff that's currently sitting in next. I'm not gonna talk about when, oh, maybe I should. So, we decide, right now large folios are only allocated during Read Ahead. We don't yet do it in the right path. So, if you are writing to a part of file that you have previously read, the Read Ahead will already have created large folios and may have created large folios. And it will see the opportunity to use the folios which were already in paid cash. But if you're writing to, if you're appending to a file, or if you're writing to a part of the file which happens to not be cashed right now, then it will create all the zero pages. And we need to go through and figure out, okay, so. What criteria should we use to start creating large pages in the, yeah. So, this is really what I wanted to talk to this audience about. Where do we go from here? So, some things are obvious. Like, there are large chunks of the M.M. which don't use folios today and should file systems. What I really want from file systems is that they take a look around and see what they could not write themselves. That is, Dave Howells has done a huge amount of really important work creating common code for network file systems. It was done in aid of making cash FS work better, but it's really created a common layer that any network based file system can use. And even some file systems which aren't technically network file systems, like I think Fuse can use it, and maybe a couple of other file systems can also use it. It's really pretty neat and it's worth looking at. But a lot of the inspiration that we took in order to create the net FS layer actually came from my own app. It's not a perfect replication, but a lot of the inspiration for it came from my own app. And what I'd really like to see is more file systems moving away from using buffer heads and towards IOMAP. That's for block based file systems, of course. It just makes everything better. It really does. It would technically be possible to create, to convert all of the buffer head stuff from using pages to using folios. And we're probably going to do that anyway. I've made a start on that. Some of the patches for doing that are in 5.19, or will be in 5.19. But there's some stuff that, there's some code paths like making things up to date and making things dirty that are actually ON squared in the number of buffer heads on a page. And that's manageable when you're talking about 512 byte blocks on a 4K page because you've got eight of them, so that's an order of 64. But when you're talking about, oh, I've got a two megabyte page and I've got one kilobyte blocks and all of a sudden you're talking about orders of millions, and it's a linked list walk as well. So approximately you've pessimized for a modern CPU. People need to move away from buffer head. Those same kind of operations, by the way, are a bitmap scan in IOMAP. So you've got one bit per block and so that's clearly cheaper for a CPU to do than trying to walk a linked list. Do you want to take comments now or? Sure, yeah. Sure, so you've already talked to the EXT4 community a bit about this subject and I was wondering maybe be useful to maybe go into some of the discussions we had over how to actually convert to IOMAP as in read path only or the happy easy cases first because switching to IOMAP as a file system developer can seem rather daunting and going to IOMAP can be an incremental process and maybe that's worth a bit of discussion now perhaps. I think a bit of discussion now is a fine idea. Do you want me to kick that off or do you want to kick it off? Sure, why don't you kick it off? I could do it, but I would be repeating what you told me, so. Well, that's fair. All right, so, right, as Ted said, he and I have had discussions and not just he and I, of course, Jan and others as well. So IOMAP is currently missing some features that other file systems like EXT4 have, right? There's DM Verity, not DM, FS Verity, a few other things. So the easy thing, the easy start to this is to say, okay, we'll define one set of address-based operations for the easy cases where those optional features aren't enabled and then we'll have our current, we'll keep our current buffer head ones around for that. And it's not too dissimilar to how EXT2 had the no BH mount option, except it wouldn't be an explicit mount option. It would be like, okay, we're going to choose the right path for, we're going to choose the first path for you. And so if you didn't enable things like FS Verity or what are the other features IOMAP doesn't have yet? FS Crypt, FS Verity, I think compression tends to be handled by individual file systems. I think the other thing that's probably worth mentioning here is that at least at the moment, a lot of the value add for folios is on the read path because of the read pages being able to create large folios. And if a file system is doing something special on the right path, in terms of how it handles reflinks or delayed allocation that don't necessarily fit yet into how IOMAP does things. There is an alternative to trying to get it into IOMAP first is you could just simply do the read path and keep the right path in the file system specific buffer head approach, even if what that means is on the right path we have to break up the folio. At least on the read path, we'd be able to get the benefits of folio. Yep, absolutely. Everything Ted said that, I agree with that. Yeah, but and to elaborate on that, obviously one of the things that does need to happen is that IOMAP needs to grow support for the things that it doesn't currently have. And that could come from any of the file systems which happen to implement that feature and it doesn't. So if ButterFS wants to do the... So I'm working... I suppose you can hear me. So I'm working on converting ButterFS or at least making ButterFS use IOMAP and some of the problems which I faced of course was the way ButterFS does its rights in the sense the way it is written to disk. So there were hooks which were added, at least it's in the code, in my temporary code right now, which does submit IO primarily for the BIOS. So it basically pulls in the BIOS from the IOMAP and does the rights accordingly. We are doing something similar for direct rights already and we'll have something similar for Buffered Deeds and Buffered Rights as well, which would basically do all the compression or see some checks and all those stuff internally. So yeah, wait for the code to come in. That's fantastic. Thank you. I really appreciate that you're doing that work. That's marvelous. The more of us who are doing that, the better. Good job. Is this one working? There we go. So at least for ButterFS, we have this kind of like really annoying thing where we have to order the page lock before our internal file system locks because of page reclaim. I don't know if this is on your map, but I find it particularly fucking annoying and I would love to have that addressed. And I know it's not super easy, right? Because you have to lock the page to know that it's on a mapping, so then call into the file system to whatever. So I know it's not, hey, yeah, can we just not have it locked? But having a way to do this, so maybe we can order our file system locks above the page lock, especially because our range locking is kind of problematic. It makes it annoying because we had to do allocations to do the range locking. Are your file system locks sleeping locks? So yeah. Are those sleeping locks? Your file system locks you from right here? Yeah, they're sleeping locks. This was totally not on my radar. Because I don't use ButterFS myself, so I wasn't aware that this was a specific problem you guys had. So yeah, let's talk about that afterwards and I'll put that into my pot and do some things I think about. Yeah, perfect, thanks. So Joseph, actually, are you talking about the extant locks? Yeah, yeah, yeah. So technically if you do the extant locks and IO map begin or in case for write back during the map blocks, if you have to submit IO hook, you can do it when the end IO is called. Right, so more what I'm concerned about is release page and invalidate page where you call in with the page locked and then we have to lock the range to make sure there's no IO outstanding? Yeah, so if you have something like release pages and all you can always use. So I tracked down release pages and invalidate pages and it was primarily for truncate operations and I've put it around that but if there is anything else I would probably I should know about. Yeah, well I think the MM will call us like for direct reclaim, right? They'll let people want to chime in here. For like direct reclaim, they'll call and say like, hey, chuck this page right now and then we have to tell it, yeah, we can or no we can't and we have to take the lock for that and that's outside of the truncate path. This is like, this is one of my kind of pet peeve. Not pet peeve, it's not anybody's fault, right? It's just the way it's designed is that, you know, MM wants to do something and they have to tell the file system to do it and we don't have a real great way to handle that and like you have to understand the intricacies of how the system fits together. I've found this particular problematic is I'm training new people to work on ButterFast. They're like, why does it work like this? Well, sit down, do you have an hour? Explain it, it's like a huge gotcha and I would love to make that interface especially a lot more simple. Yeah, I don't think these problems are specific to ButterFast. I think, you know, extended four and extended three had a lot of problems with ordered IO versus using pagecast versus using buffer heads and a lot of that work was fixing this locking cycle so it's all of us in different ways. I mean, you could always take the same approach that XFST did and delete your right page operation. Yeah, that's becoming increasingly attractive. Yeah. So, like the other, we're gonna get off on a tangent here so stop me when you want. Hey, I'm not supposed to be starting till 9 30 so we've still got seven minutes of bonus time. Perfect. So, like the way page reclaim happens is really kind of problematic especially for like low pressure systems. We see this all the time with like XFS, you start to go do like XFS has the shrinking thing and that's where they start writing pages, right? So you do like a low level, you have some memory pressure and then it just like stalls the entire system because it decides it's gonna write out some pages because we don't have a great interface to say like, hey, file system, maybe write some stuff back, other than balanced dirty pages, right? There's not a really great interface for page reclaim, like just hey, give me what you can right now and also hey, write back so you can reclaim because it's all kind of based around this idea of like the MM tells us we want this page specifically right now and maybe we could rethink file systems. Yeah. I mean, honestly, this sort of part of the MM is something that I'm very much still learning because it wasn't important to me until just recently and then I started trying to convert it to use folios and it was kind of like, hang on a minute, why are we doing this? Like how much does even still make sense anymore? I mean, I think back when we had less capable file systems, sure, it was fine, right? Because we're in reclaim, everything is going to suck anyway and then I started talking with Dave Chinner about what XFS was doing and why XFS got rid of its right page operation and it's like, well, competent file systems don't do that anymore, competent file systems don't let themselves get into a situation where the MM is saying, I need this page written back right now. A competent file system will have all the spindles available to it running at maximum speed doing giant write backs of the stuff that the file system knows makes sense to write back and if a page happens to get to the end of the LRU and it's still dirty, the MM needs to move on and try and put that page back on the LRU somewhere and try and reclaim the next page off the LRU because there is nothing the file system can do because the spindles are already running at full speed creating new clean pages. Yeah, looking at it from the file system point of view it's like just not just one page because it's almost a range of pages. So whenever you write in butterfs you kind of write the entire extent. So if you need an information saying that, okay, this page needs to be written it should be probably a range of pages rather than just one single page. Well, so from the VM's point of view it doesn't know that. All it's got is it's got an LRU and it's pulled one page, it's pulled the last page off the list and it said, oh, this one's still dirty. Go write it back for me. And I think it just needs to stop doing that. The problem is that we still have less competent file systems in the tree, right? But I suspect that for a competent file system and the three file systems who've been speaking or competent ones just delete your right page operation. See what happens, right? I mean, if it blows up I think that was also very important information. But yeah, delete your right page operation and see what happens. Yeah, I removed it from AFS. Yeah. And it still seems to work. So it's not, that hasn't merged upstream yet but I've taken it out of my own tree it still seems to work fine. 9p is a bit harder because it only has a right page operation some of them do. So those need converting to right pages first. But with the network file systems I'm gonna try and get rid of all of that from the file system anyway. I'm gonna talk about that later. That sounds fantastic Dave. Good job. It seems to me that there's multiple problems that sometimes MM is trying to solve when we talk about right page, right? There is the global memory pressure problem, right? Where I just need free memory tell me where I can get it and I'm not particularly picky. There is the C group memory pressure where we're trying to release relieve memory pressure on a particular container, right? And then we've got the compaction problem where we're trying to actually free a specific page because we're trying to reassemble a two gig huge page, right? And maybe we need to think about different mechanisms for those different scenarios as opposed to assuming that like one solution is gonna fix them all. And I suspect if we actually tried to, you know explore all of those we're gonna rattle so maybe now is not the time to do that but I just wanna point out that removing right page might solve the global dirty memory problem but there are other reasons why I suspect MM might want to say I'd like you to clean that page. But you could call right pages with arranging that page. But the difference between right pages and right pages that right page they've been has already locked page. And that's been proving a problem for me to try and get around you. So I'd love with is how you just get rid of it. I mean, one of the things a lot of the more advanced file systems do is they will actually, I think we were taking right page and just implementing it in terms of right pages and right pages would be free to write more pages than what are actually requested by the MM, which sort of addresses that problem. And then we basically dealt with the locking inversion problems frantically to deal with that particular situation. Yeah, because. So, yeah. It was one of the scenarios I've seen is I need to write the page before the one that's locked by the VM. And that has all sorts of nasty problems of another bit of the VM is saying oh, can you write that page? And I've got a deadlock situation that I can't easily get around but just get rid of right page works for that. I do think Ted highlighted a lot of the right ideas in terms of like peeling out exactly why the memory management subsystem wants a particular page written because the huge page case and the container case, those are gonna be a lot harder to solve where the container one I would hope the containers are shuffling memory around and all the same optimizations we do globally tend to work as long as your containers are fairly large. But the huge page case will be different. So we should probably break it out. Yeah, yeah, I think that's good. Yeah, I did want to touch on the fragmentation compaction kind of angle to all this which is that if we are using larger pages then there's actually going to be a lot less of the problems that we currently have because it's much easier to find 60 pages and 16 contiguous order five pages than it is to find, yeah, 16 contiguous order five pages than it is to find 512 contiguous order zero pages. So some things are gonna get a lot easier with this. I've seen some people being sending me some interesting benchmarks that they've been doing and clearly they're not particularly representative workloads but I do see reductions like by a factor of 1,000 in number of pages on the LRU list which is just insane. So yeah, I mean it's great to see but it's like that's at least twice as good as I thought the maximum was going to get. How on earth did that happen? I don't know, maybe I just screwed up the statistics. That's been known to happen. So I really know about some of the problems that we are going to see. One is that we now only track dirtiness on a folio basis. We don't track it on an individual page. So if you have a workload that reads in 128 kilobytes, we're going to create an order five page and then we write to one byte in it. That's going to mark that entire 128 kilobytes as dirty. And so that means we're going to see an increase in write bandwidth but on the other hand, particularly for a copy on write file system that will reduce the fragmentation because all of a sudden you're writing out 128 kilobyte blob and updating the points to that rather than writing out a four kilobyte blob. I don't think it's going to be a serious concern. I haven't seen any workloads that significantly suffer as a result but Chris. I know Jens has put a lot of time into making it possible to do much smaller granularity of writes through IOU ring, specifically to resolve the memory bus utilization concerns of even writing 4K or 8K or 16K. Yeah, that's mostly been on the read side and Keith actually posted Patrick for this very recently. So I'll pass it to Keith. Delegation, isn't it grand? Yeah, I did post the patches but I never saw them hit the list so I have a configuration problem. So nobody saw them unless you were CC'd directly. Yeah, so on the read side it's a bit protocol specific if you can do subsector reads. So I'm focusing of course on NVME and that's going to, it's using the bit buckets in order to do partial sector reads or straddling two sectors. So this is going in the other direction where folios are being bigger, this is going smaller, so I'm not sure if this is a good spot to talk about this, but yeah. I think we want the big pages and the little IOs. Well, do we? I mean, are we giving up a huge amount here by saying, well, we've got this one 28 kilobyte page, so that's going to absorb, potentially it might absorb more writes into it before it gets written back. So, you've got less than opportunities to create a checkerboard pattern in your file system layout. I know it's certainly not a definitely a bad thing. I don't disagree with you, I just know we have workloads that are massively memory bus constrained. And so writing 128K when we need something very small will be significant. Do we have any stats Chris on? I wonder even for 4K pages, right? How much do we dirty for the 4K we're writing out? I could easily see even medium-sized pages, right? You're going to be writing a substantial amount of data over the bus, they don't really need to. So in ButterFS, I would have been taking Matthew's position of, just write it who cares. So I don't know, it's going to be very workload specific I think. I guess the big question is if you do a kernel compile. Because I know someone who cares about that a lot. On my laptop, the kernel time goes down by like six seconds for a kernel build, but the overall isn't particularly effective. No, but it's probably a trade-off, right? So you're saving some CPU, but you're using a lot more bus bandwidth. And as Chris said, for some applications, they're bound by that currently, right? So if you add 25% extra bus bandwidth, you're going to be running 25% slower. I don't think I am, because think about what a kernel compile does. It does a lot of reads. It always reads the entire file, because you've got to read the entire header file. And then it does writes, but writes don't create large folios, because it's always writing to a new file. And it will read back what it already wrote, but that's probably still in cache, and then it's creating a VM Linux at the end of it. I don't know, it all sounds to me very much like it's, it shouldn't really be. But for a kernel compile, most of the files are being created, written, front to back, and closed, so it really doesn't make any difference in that situation. Yeah. In any case, I think it'd be nice to have some statistics or metric right for this, so we know. Because for some people, it will definitely make a big difference. Yeah, if we're overriding. And maybe this is somewhere, another place that IOMAP needs to be enhanced. Maybe IOMAP needs to be keeping track on a sub page, a sub folio level of exactly which blocks are dirty. Isn't it doing it already? Because I saw it writing partial bios of, or at least offsets and lengths. Yeah, okay. So IOMAP does have already an optimization, which is that if you are doing a right, which is aligned to, and a multiple of the block size into a page which is not up to date, then it will not read the parts of the page which are not being written to. It won't do any reads. It will just say, okay, we've allocated a page, it's not up to date, but these parts of the page are dirty. And so then if that page is synced, it will do a partial page right. So that's what you've probably seen. I don't know that it's a particularly common thing for users to actually do, but it's a particularly common thing for benchmarks to do. So I think that's why the optimization's there. But I mean, it's software, we can change it, right? If it makes a lot of sense to actually track the dirtyness on a per block basis, even if the page is up to date, then let's do that. We should just do that. There's probably some code to be written around if the file is M-mapped and we take a page fault. If somebody does a store, then we need to make a decision about whether we're going to make all the PTEs that cover that writable and dirty the entire thing or whether we're going to just flip that one PTE to be writable and make that one page dirty. And that's kind of its own set of interesting trade-offs because some CPUs, actually, no, AMD have published this. It's in the AMD public documentation. It's something like 16K or 32K. If it's aligned and the PTEs are all compatible and it includes the writable bit and it will use a single large TLB entry instead of using a 4K TLB entry. I don't know how much that optimization is really worth. It's not something Linux has actually supported today, and I was talking to them about this because this is something that the folio is now enables. Page is actually just being naturally, coincidentally in order, which actually happens more often than you might think, but folios make it much more likely and it certainly makes it easier to detect that we're in a situation where that could be of use. On the other hand, perhaps if somebody's doing an M-mapped store, we would like to only keep the dirtiness to that one page. That sounds like something that we would need to benchmark across a wide variety of workloads and a wide variety of CPUs to figure out which optimization is worth more. But now it's a discussion we can have. We're in a situation where we can choose which of those optimizations we want, which is fantastic. I think that that's a good idea, by the way, this is getting some metrics on it, so you have wildly different file systems, some with long latencies, some with short latencies, right? So if you're going over a network or if you have a long latency file system, you want larger IOs in parallel, right? Because the round trip cost is too expensive. So how do you measure that? Well, file systems have their own stats, NFS, SMB, all this, right? But looking at how many IOs in parallel on their average size is objective and starting with the boring workloads, you know, your kernel compile is a good boring workload, copying a file is a good boring workload. But, you know, stepping back a lot, how would we tweak bullios to kind of try different strategies? I don't know, well, not that easy. There's ways we could, you know, have two different patches that we could apply or a PROC setting to experiment a little bit because I don't think we're gonna get it right for a while. I mean, we'll get it better, but getting it right because, you know, trade off on sending too much data versus sending data, waiting 10 milliseconds for a network round trip, sending data, waiting 10, you know, that's bad, right? We have to figure out what's the best way to do this. And I guess the last question I would have on this is some of this is confusing because we haven't all looked at the MM code like is, are there some sample change sets, some sample patches that are examples of the boring, easy stuff that we as file system maintainers can do? Sort of like, you know, the cleanup when people went to x-array. Give me an example, and then it'll be like, oh, I can do that. Because I think that some of these are just cookie cutter repetitive kinds of changes, and if we could just give some examples, that would be fantastic help. Yeah, so to that last point, see the patches that I've been posting for the last week or two. What I've been doing for file systems is I've been converting all of the address space operations. There are two left that I need to do, which is my great page and right page. But other than that, all the address space operations now take a folio as their first, instead of a struct page. And for most file systems, what I've done is as the very first line is struct page star page equals address of folio arrow page, right? And anywhere that you see address of folio arrow page, that is a bad code smell, right? That indicates somewhere that needs to get cleaned up. And it's not that the code is bad, it's just nobody's got around to cleaning this up yet. But to be absolutely clear, my intention is that every file system will be converted to using folios. It doesn't necessarily have to support large folios. It may not make sense for some file systems to support large folios, but they should all be using the folio interfaces. And the reason for that is that I am looking to get rid that first chunk of the union of struct page. So the LRU, the mapping, the index, the private, that is all scheduled to go away from struct page. And this kind of gets into something that I had very long discussions with Kent and Hannes about. You know, what is the future of struct page? Because I know the falsely some people aren't going to care about all this, but you're sitting here, so you're gonna hear about it. From a memory management point of view, there's a whole bunch of stuff that we want to put into struct page. But keeping the 64 bytes is incredibly important for performance right now. When it starts to exceed a cache line, I guess expensive, really expensive. So what we would like to do eventually is shrink struct page. Kent says we can get down to two words, I say we can get down to one word. You know, the details aren't important, but if we can shrink struct page to just be a pointer to the struct folio or a pointer to a bunch of other things which are currently all, you know, these various different unions, really they're their own types and they should be treated as their own types. And the reason they're not is, of course, because it's easier just to put a, you know, dump another thing onto the top of this than it is to refactor the entire world. Here I am refactoring the entire world. So yeah, the idea is that, this is kind of what you need to understand. This is the high level overview of struct page, right? So there's a per purpose union, well two per purpose unions. There's a ref count, there's the flags and there's some other gunk. And I think we can get rid of basically all of it and just have per four kilobytes we have a struct page which is just a pointer to the real data structure. And then when you go and allocate an order three page instead of having eight times 64 bytes, you have eight times eight bytes, right? And everything, you know, things become less special, right? And actually you end up using a lot less memory. And that was Johannes's big point and I hadn't really been thinking this in this direction at all because I didn't think it was relevant, but he convinced me, no, this is really important. But there are essentially giant pallets of money being set alight in parking lots every month because people have to buy extra RAM in order to provision various services. And if we can get back the 1.6% of memory that MemMap occupies, the entire world will be a better place. Or at least our employees will make more money, which, you know, well, perhaps it's the same thing. They'll be able to afford to send us more conferences. Yes, that's how it benefits us. One thing I'd like to get rid of is the right beginner right end properly without that. Yeah, right beginner right end. Okay, so those are two address-based operations for the very management people. I think the file system people all know about right beginner, right end. My impression was that right beginner, right end were designed for EXT3's model and then everyone else kind of squeezed themselves into EXT3's model. Perhaps it's time to overhaul how we do right beginner, right end. Yeah, if you're using our map, it doesn't use right beginner, right end. It's basically a different function called all together, map, blocks, and yeah. Yeah, yeah, I own map, doesn't use right beginner, right end. And neither does NetFS. Well, when I finish writing it, the NetFS won't use them either. So, all the network file systems that use NetFS they won't be using those anymore. So clearly now we just need all the block files that's supposed to be right beginner, right end and then we can delete them and we can shrink out the address-based operations. So, right beginner, right end are very much the old, pass a bunch of callbacks back and forth level. And this is actually my big complaint with IOMAP is that it's a much cleaner version of that old model of we're still passing callbacks back and forth. And I'd like to nudge people to think more about how do we pass data structures back and forth. My big thing with the bio iterator rework and bio-ovi-iter get pages is now called is that we can now just from say the top of the DIO path start by allocating a bio, pinning user pages directly to that and then pass that down instead of calling back and forth through indirect clunking calls which we want to get away from anyways because of specter and up and down. Yeah, I'm also not a fan of exactly how IOMAP does it but at least IOMAP is only doing it to itself and it's not infected the entire VFS. So yeah, I think there are opportunities for improvement. And actually, when I get round to converting that path to doing folios properly rather than just coping with folios, I think you'll like where I'm going with it but I'm not quite ready to show the code. It's, I've written this and abandoned it three times already, so the next time it will work, I'm sure. And you know, I think it's important to note here that like as a file system person, I could not give less of a shit like how the interface works, right? Like I just, I want to focus on BlutterFS and I really, really don't want to have to wander out of it and understand how anything else works. And so like, I know we like to get in these kind of like shed, you know, like shedding conversations or whatever but like take it from the guy who's got to do a lot of this work. I really don't care guys, just pick a thing and do it and I'll tell me how to do it and I'll do it. Like, I really don't care. So like, you know, like I'm with you Kent, like it offends my taste a little bit to pass callbacks through. But like in the end, I don't care enough to argue with anybody and I don't care enough to like want to argue for the interface. You tell me the interface that you want me to use and I'm just gonna make it work. Like that's all that's gonna happen. So I think that like these kind of conversations need to happen with people that like actually care about that stuff. Like I know Derek should be here. You know, like I think that it's a valuable to have those kind of conversations but I just want to make it clear from the file system at least my file system perspective. God, I don't care. Like I just want it to work, man. Do you tell me what to do? If it looks ugly, I don't give a shit. I didn't have to write it so it's fine with me. You know, I love file systems. I've written three. Fortunately, none of them have escaped into the wild. I don't have to support any of them. But everyone should write a file system or multiple file systems. It's the most fun you can have in programming which is why I want to make file systems easier to write and everything I do to the touches the VFS is in terms of how can we make file systems easier to write? Like how can we shrink the interface so that file system authors don't need to learn about all of this complicated stuff? And you know, I hope I'm succeeding with folios. I really do. I hope I'm giving you an interface where it's like, okay, unless you're talking about the page fault path, there should be nothing in the file system that cares about pages. I'm just saying, here is a folio which is a pointer to a bunch of memory. I would like you to fill it with the current data or I would like you to, it's dirty. I would like you to get that to disk, please. Like I'm really trying to make it as easy as possible for file system authors because I think file systems are awesome. Yeah, I can't agree enough with that. What you just said, that was it. Absolutely, as file system guys, our life is miserable anyway with thousands of different weird problems that have nothing to do with memory management. So like from my perspective, I would just say I would love an AC, I'll give you an async write interface and a synchronous write if you prefer, you can have both. And I'll tell you my preferred IO sizes, view a clue as, we have to be able to throttle, right? So file system has to know when it needs to throttle back on reads and writes, read ahead, right behind that kind of thing. So some measure of we can have more IO, we can have less IO, we need to throttle, we don't need to throttle. Beyond that, I don't care. This is all memory management. So I just, while you were talking, I was grepping my great page, trying to remind myself what my great page did. Maybe you guys should care a little bit more though, because a lot of, we're in a state where a lot of this, a lot of these core kernel interfaces have kind of like rotted and not changed much over the past 10 years and become really painful for everyone to work on. And they haven't had any clear owner until like Matthew came along and started redoing a lot of this stuff. And maybe if we talked a little bit more about this stuff, like what our pain points are, what, and took our experiences to figure out, okay, what is the clear model for it? What do we want? Then maybe we could share a little bit more code and have like a little bit less painful. Well, I'll toss out my example. I don't know if it's still, I don't know if IO map does page at a time internally for copying the pages out of the user's page into the kernel. Like it does, yeah. It does still. So, ButterFS not using right begin, right end. That was partially me being like, well, it's stupid to do it a page at a time. They're giving us a bunch of pages. I want to do it on a bunch of pages. And the number of bugs I made in that code is was unbelievable, right? It was exceptionally hard to get those specifics right. And so if we're talking about making interfaces that work well with folios, like that's one thing that you can do to make it much, much better is make some framework for safely, you know, copy from user. I stole that. Yeah. It's so bad. It took me so many tries. Yeah. And thank you because having the ButterFS idea to work off of that made my life so much easier. My Buffered Write Path looks basically like yours. And maybe we should collaborate and promote some of this stuff so that other people are following down the old path and can benefit from our experiences. Yeah, well, I would love to just pile it into IOMAP. I don't want to be special. I just want not a page at a time. Yeah, and I think like, you know, to your point, Willie, like this is a really good point is there's a lot of this stuff that just isn't owned. Like I think it's really great that we have like Derek is like IOMAP maintainer, right? Like we have to have people that care about this stuff because I am constantly at the edge of burnout because of the massive amount of things that I got to keep track of. I'm like adding more and more and more and more and more things that I got to keep track of is not a recipe for success. And the, you know, I'm so fucking happy that you are doing this, right? Because there's an owner. And like, you know, we can argue and we can complain and we can, you know, bike shed and stuff. But like at the end of the day, it's important to have somebody that actually like cares about this code, cares about these interfaces to make sure that they are well maintained. They're keeping up and keeping track. And that is hard work. Like it's hard work and it's oftentimes thankless work. So I mean, thank you, Matthew, because it's not being missed, right? Like we understand that it's fucking hard and somebody's gotta do it. So we're thankful it's you and not us. But this goes like to Kent's point goes for everything. IOMAP is great. Derek's in charge of that. You, and this kind of goes for all of these like things that we just sit around and kind of do ourselves. We really need to like kind of put an effort into the community as far as like, okay, not only do we need somebody in charge of this, but we need to make sure that like they don't hate their lives trying to make this stuff work. Thank you, Joseph. I really appreciate that because one of the things I'm conscious of with Folios is that I am imposing a cost on all of you. I'm asking you all to learn to retrain your fingers to no longer type page dirty, but a folio test dirty, right? Like that's a cost and I'm imposing that on everybody. And I feel the weight of that and some people have made that point to me, some in more polite ways than others. And I, so it's really important for me to hear you feel the cost, but you also see the benefits that you're willing to pay that cost. So thank you, Joseph. Okay, that's my time. Thank you all so much. As you know, I'll be around for the rest of the conference and I'm sure some of you are going to grab me in the breaks. Thank you.