 Right, Sunset is setting buffer hats. Everyone seemed to be agreed that buffer hats is a bad idea Was a good idea once, but its usefulness has also the outlet is itself Right, so we want to get rid of a question is how Problem is that the buffer hats allow for a nice sector size IO and What's more a sector size IO of the underlying block device? And this is being used by quite a lot. Well, one could say nearly all sizes and which have not been converted so What can we do there? So essentially there are two roads open Which I've seen her which I'm seeing currently the one is replace them entirely with folios and the other one is to Converting buffer itself to folios. I had done a conversion I'm working on a conversion for the FAT file system to switch over to IO map and use folios directly internally and Seems to be going reasonably well So anyway, so what would happen if we replace with folios? So Initially, yeah folios can only do page size IO as it stands now Meaning yes, we will get the same data then we used to get But we get will even get more than that data namely not only the sector which we looked for but also the sectors adjacent to it Question is does it matter? because It's a file system and as the file systems are tightly tightly packed. Anyway Chances are that we will be reading the very next bot a block very soon afterwards So really? Does it really matter plus? Well IO performance really isn't that an issue and I would wager It virtually is impossible to measure whether we read five five twelve bytes or 4k So really I personally I would think freaking hell. It doesn't matter. Just do it and Move to a folios would have a dose of a nice advantage that we'll be using the page cache directly So all this weird buffering the buffer heads do currently internally essentially replace a or we Doing some things which the page guys already do can just plainly removed and be done with That would be good so The read side so far is easy Not so easy is the right side Because we are as we're doing sector size. I owe We are sought we sort of assume we will just write this one sector When using folio, that's actually not quite true because we will be writing if we're not doing something special the entire The entire folio, which is more than a sector Thankfully, there's a patch that from Ritesh allowing for sub page tracking of the of the folio for for right for for folio rights and That could easily I just have a patch that can easily extend it to just essentially just mark the bits Which we actually wrote within that folio as dirty and then It'll work Surely so it's the first time I've mentioned it to falsely some territory. So I just assume alright don't work go So That is the approach. I'm taking currently is that feasible or does anyone see any issues with that? Get your own mic Yeah, unfortunately one of the things that is a sticking point will be the The block device cache be def cash Which block device cash be def cash fs be dev That's E. So we have a simple file system that basically is used for partition scanning and That also is used through buffer heads for metadata So file systems that use buffer heads for metadata will have to use that unless they port over to something else and It doesn't seem like it might be worth the risk for those file systems to convert over to something different for metadata I doubt that would happen, but I mean folks are well, so the thing is that and it does not need to be a Full conversion at this time Because there's a pitch a helpful a Christophe did a patch set for config out buffer buffer heads So you can run your entire system without buffer heads. Yes. Yes. It's certainly possible It's more of a question of file systems that are using certain API's of buffer head provides For metadata IO rather than the data path. Yes, and essentially unless they're willing to Move over that metadata API somehow to something else that doesn't rely on That was the next very next paragraph. I was coming to Yeah Because I actually who would do that work and in with the file system developers be willing to Move over right. Okay, so we are trying to move to a new API API Namely replacing buffer heads with folios and I guess we sort of all agree that this is something we want to do Yeah, so I actually Think we want to replace buffer head with something else Which is going to be attached to folio instead of a page, but there has I believe there has to be some intermediate layer That's kind of generating this like you are right that buffer heads are ancient by this time and probably what we need today Is very much different from what part of the impends but also Luis is right that there is some service That's still the file systems need Yes, I do fully agree with that we do need a Essentially a dropper in a dropper in replacement for at least as we be or as we be read. Yes Yeah, so so it's not not only about reading like blocks for And yeah, without having care we don't having Having to care about the size of the polio But there is also stuff like, you know, tracking tracking association of metadata Blocks to the I notes. That's also what buffer head layer does and file systems use it So for example, simple file systems like fat, but but also ext2 and udf and other that they use the they use the Possibility to associate buffer head with an I note that that's kind of one of the darker corners of buffer head And then on f-sync they use this list of metadata blocks to do it so and that is so thank you And that was my question. Do we need to care? Because essentially what we do is we are just dirtying individual folios and As we are having to convert when we do the conversion, we're having to convert over to write pages So we can't write on individual pages at least to my understanding might be completely wrong again It's the first my venture into file system territory. So everything I say here might be completely off topic So can I go back to your right amplification comments? Yes, please because your assumption is that we're basically on a file system That has a large block size and obviously we have an underlying smaller right size. That's what causes the amplification Yes, but if the file system is already knows that it's got a large block size It has to be efficient about filling that block for a right Do we need to care about right amplification as in is sub page dirty sector tracking really worth it? Well Sorry that that alone is not the problem of right amplification So right amplification can also happen when you have a larger page size and a smaller block size So, you know somebody went and because the folio tracks the dirtiness at the page size And if you try to write a 64k where you intend to write only one block of a file Yeah, but the point is if the file system already knows what it's writing in Presumably, it's trying as hard as it can to actually fill the whole block All we have to do is wait for it to finish filling its whole large block size and then write the whole lot back unless the file system layout is Geared up for 512 blocks a 505 bite box. Let's call them fat the layer Yeah, so I seem to Yeah, so I think File file system do block a or if you modify one one bite It's going to do a remodified bite of a single block the issue here Is what is the the relation between the block size and a paste size? If you if you can pack multiple blocks into a page you have right amplification Yes, but you can't it cannot always do that because some files like fat Work around 512 bytes and you can't really exactly that's the point in that case your from your block size is 512 The page is 4k you have a block in the page. You have to track that. Exactly. You have to track that and you can't really know you You should actually look at Ritesh's patch set because he lays it all out pretty well in the cover letter about how he Actually has a not notice a benchmark, but an actual like workload Which suffers greatly from having a 64 kilobyte page size on power PC? Yeah, so I think the reality is there is a Lot of functionality that is carried by the buffer cache code Different file systems use different subsets of that buffer cache file system code Some sub page dirty tracking is just one aspect of it And the reality is even if you want to say that You know all file systems that use block size less than 4k are not worth thinking about Number one eekstie 4 and eekstie 2 still support 1k block sizes and we support 2k block sizes for IBM mainframes but we also have You know We also need to support Architectures with 64k pages where you want to interoperate with 4k block size Because the file system was originally formatted on an x86. So it's complicated the other functionality which yawn has just mentioned is associating Dirty buffers with i-nodes that is something some file systems use others don't And yeah, if you can convert the entire world to IOMAP maybe some of that goes away, but that's a pretty big lift The other one is File systems that use the jbd2 layer which include eekstie 4 and OCFS 2 And there's a separable question of you know how well supported is OCFS to apparently there is a maintainer for it so but one of the things that we've certainly looked at is Does it make sense to? you know essentially Grab a huge chunk of the buffer cache code that is only needed by jbd2 and Essentially moving that into the jbd2 layer. So instead of buffer heads will have You know, we already have a journal head structure actually that we layer on top of the buffer head So we could do something like that One of the additional complexities, which is probably an eekstie 4 specific thing is We still have to support user space utilities Opening the block device while the file system is mounted And so we have to keep the cache coherent between rights to the block device and to for example The file system superblock and that is another set of functionalities that the buffer cache gives to us for free I think the reality is there is value in Replacing the buffer head code just simply because it's really really ancient code Even if there are big chunks of that functionality that will still be needed by some file systems And the challenge is how do we get there? Right, it's going to be a very very incremental replacement But I do believe that you can't just replace buffer head with folios for metadata blocks at the very minimum There will need to be something maybe we put it in libfs.c That is a common layer that will be needed by all the file systems that support fsync Right, and I don't even know if fat supports fsync. So yeah Yeah, and I think this is kind of like getting at the main problem here is that like You know we all talk about hating buffer heads, but in reality Every file system manages their metadata in a special way EXD4 and JVD and FAT use buffer heads as this like common thing But like butterfs has extent buffers that we layer on top of you know struct page currently will be folio I promise Willie You know except and then XFS is XFS buff, which is again just this extra bit on top of what we need to manage the metadata Versus the pages. So I don't think that like Getting rid of buffer heads is like the big huge win that we think it is and I also don't think that like creating a new Common thing for all the file systems that uses the other the a good answer either because again Unless you put a gun to my head, I'm not going to go through and redo all of my extent buffer stuff Just to use something common when it just so this is not not it's not about forcing. So at this time This is for outlining away How file systems can be converted right? So that really is what this one is about to give user slash maintainer slash newbies whoever away, right? Listen, so this is how things could look how we envision. They should look Go figure whether it applies for the for your files. You care about Yeah, I mean I think for those file systems that have something that is not buffer heads So butterfs XFS this won't matter to them. We're not of course Wouldn't matter, but there's no for over two dozen simple file systems that are still using buffer heads for metadata And there needs to be a common support layer for all of those file systems And so what I'm saying is I think that the answer is Do all the work inside a buffer head to use folios and then just everybody that's using buffer heads They nothing happens. They can just pretend like the entire world stayed the same and It changes underneath them and Willie's happy and we're happy and everything is golden Okay, so I'm going to ask a question about that Do we want to support large folios? Yes with Buffer heads so that depends on the question. How do we go about with buffer heads if we go the way? Suggested as in converting buffer heads themselves to use folios then the answer is well obvious Yes, of course you would because then you would have it automatically No, no, right, so we have a choice we can choose whether we support Single-page folios or whether we support multi-page folios with buffer heads Why would we see a difference from buffer heads? Because we have things like maximum number of buffers per page sized arrays on the stack Yeah, so because yeah, so my suggestion for that solution is we keep buffer heads simple So it only supports single-page folios and if you want the win of multi-page folios switch to IOMAP period and the file systems that we are talking about the two dozen simple file systems like VFAT and Amiga DOS FS and whatever the hell we don't care about performance Well, we don't care about large folios support. Well, that is not not quite true So and that's the reason why I'm why it shows a fat or VFAT as a conversion Because it's not because I like fat in fact. No quite the contrary reason is bloody you fee if you if I Because that actually requires you to have Yeah, but you if I use is barely fucking used like nobody gives a shit because none of these fucking file systems have gigs of At least some some distributions are able to boot without VFAT at all and just using system D and E if I bar bars I know Amazon Linux 2003 does that so it might be an option for distributions to look into that Okay, I just want to go back to something. I just want to put an asterisk by I Saw a lot of nodding when when there was a suggestion of these old file systems are still using buffer heads We're going to support single-page folios. I Just want a little bit of an asterisk by which is that we do still want to support large block device large LBAs LBAs larger page size and I so I just want to put an asterisk by single-page folios to say the larger of LBA size or page size so because for those The buffer head will actually represent the entirety of the folio So they get to still use the same buffer head API, but it happens to represent the entire folio Which is your minimum right size? Yeah Block devices are basically already doing conversions from logical sectors to physical sectors They're never going to dump a block device They tried with 4k block devices and ran screaming into the woods and went back to 512 byte sectors So we're never really going to find a block device that cannot handle a sort of small logical Sector size they'd all like us to do the rights in the physical sector size Which they'll tell us what it is and if we can sort of shovel down data They'll be happy, but just for backwards compatibility with ancient Windows devices They're always going to be capable of handling small LBAs. I think So we're bleeding into Lewis's talk him and I think somebody else had a question Yeah, so I think that you know the answer here is Convert buffer heads to folios Single-page for now if we feel like we need to do something later We can address it then and this kind of makes everything better as Well as leaves painful for everybody Okay, this is this is good. I've learned I don't need to get rid of the max buffs per page Arrays on the stack fantastic. Thanks. Um, we will have another session fire web, right? So I don't need to go into the animal Right times up anyway, thank you