 large block sizes. I think that this picture represents well what Folius allows us to somehow get done. I don't know if you guys saw the movie Tetris recently? Anyone? Well yeah I think that 3D Tetris is one way to look at what Folius enables us to do at the block layer. If you think of it in small little pieces and you see really small little blocks then Folius allows us to address and work with large blocks obviously in a very atomic way. That's the goal at least. So large block size efforts really is an example of standing on the shoulders of the giants. Every time this comes up, every single LSFMM for what last 16 years I think. This has been you know tackled and the first patch set was by Christoph Lemender. Dave Chinner worked on this too. So it's certainly things that folks have been thinking about for years. There is a wiki and there's documentation there. I don't need to provide you the link right now. You can just Google it. I think it's the first entry in Google right now. Let me know if that's not right. But an example for just technology shifts here is the shift from 512 to 4k. Well you know times change right? We have different technology reasons for why we're considering embracing large blocks. But I'm not gonna get into that because we have a lot to talk about. So I'm sorry for that. XFS has supported large block sizes for many years but that's not the context under which Christoph Lemender first posted his patches for it. If you look at the mailing list and the title of his patches it clearly says large block sizes and then he goes on to explain how the goal is to support large block sizes as in this but when you're on lower page size. So yes this was supported for years and in fact my understanding is that there were some products that were sold on power for years on XFS. But you needed a 64k system. The goal of course is to support this on 4k. This is, I use Kdobs obviously for my development and I provide four NVMe drives by default so we skip that. If you enable an experimental parameter for Kdobs you will get additional blocked devices with each one more of a power to. This represents the LBA format that they support up to and the default that it's formatted. You can query the LBA formats that are supported this way and you can format an NVMe drive this way as well. Yes you can format the drive, yes you can boot but they won't work. And one of the things I was mentioned on the mailing list is that if you actually enable the large block devices for NVMe your system crashes. So we started looking into that and that's how you know we delved into this world and this is a status update of that, that effort and a lot of the dialogue that's happened here. There's a get-tree that I'm going to be moving forward trying to you know pick and cherry pick all these patches that put people post that might be related to help with large blocks. There's also support in Kdobs to do experiment with this as well including pure IO map which allows you to essentially boot into a world where you're just using IO map with Christos patches for instance and you will essentially not use buffer heads at all. You can test today large block sizes then using NVMe with large block NAX for instance. You can also use it with BRD and TempFS. So to answer your question that support is at least from our R&D perspective of course there's further you know R&D needs to go forward and you know further RFC and so forth but there's a lot to talk about so we need to move fast for this. Now let me show you guys where we're at. I like to try to keep track of stuff using OKRs. Some people are fans some people hate them. I was actually going to address the folio conversion today not tomorrow. Tomorrow is actually the IO map stuff so this is a small list but this is actually a bit straightforward. I don't think this is really that complex so let's get into the meat of the stuff. The IO map conversion for those that are still struggling with that there's a session tomorrow to try to help document what the heck IO map is and how to convert file systems over. So if you go to Google and look for a large block size just replace large block size for IO map you'll see now so hopefully some sensible documentation or something coming close to sensible. And so I'll mention a few things here. Hannes's presentation kind of did address you know sun setting buffer heads but didn't address the block device page cache and it's important because as people pointed out it's used also for metadata IO. It's not clear if we'll ever get to a point where we're going to remove buffer heads. And that's fine because if you want to support large blocks of devices there's a way out and that's basically just to use IO map. That means that if you're a file system and you do want to support large block devices you likely do want to consider a solution for this. Otherwise then you know we will have file systems that do support a lot blocks of devices and we will use a pure IO map path. It's going to take a while to get there but there's I think a path. There's a bit of the efforts here I'll describe. If folks have interest in porting over stuff or let me know I just want to keep track of you know what how things are going. On the block layer Matthew's famous last words we don't need anything on the block layer anymore. There's quite a bit of work there as we have ended up discovering. So I think we have agreed that we're only going to support one order folio. I mean we're zero order folio on buffer heads so that's not going to require any effort I guess we can cross that out. IO map large blocks I support it's a community effort so it's essentially this will be moving forward you would want to test a pure IO map path right so you want to build a kernel without buffer heads. Yeah you can shoehorn in IO map as well to a block device but it's a bit hacky you would have to do that yourself. So let's see what else make buffer heads optionals that's Kristoff's patches right. The Ritesh posted patches to improve performance as well 16 times performance apparently that's pretty impressive so there are considerations there to use IO map on 4k for instance on 64k page sizes and then there's all these other things like for instance one of the things hallway tracks that I talked about with Kristoff was essentially the complexities in block cache and all these mix and matches. It seems that we probably should be moving the block size out of the super block and because we already have the inode with the block size so we should just use that. The other odd thing though is that you may end up in a situation where block device cache has a super block with a block size but different inodes have different block sizes that's kind of odd so to fix that we should probably just have a super block per block device instead of having shared one right now we have one shared super block and eventually the other thing too to try to help with read ahead and the page cache is to this was Willie's idea actually so this was to add the page order requirements on the address what is it the address address space I think one of the things that we should review is the implications of going down some of these paths and that is that essentially if we want to support large block devices we will have essentially an IO map path. All systems that do want to support large blocked devices will have to consider either solution for buffer heads or you know try to you know I don't know come up with the library option you know it's it's not clear to me you would have a better idea on that. Yeah so I think one of the things that wasn't clear to me and I actually jump back to the original topic proposal to try to get a sense for this is just to take a step back the topic proposal talked about how all the pressure is coming from the storage vendors and so I'm just trying to understand what's the business case rationale right you're asking for volunteers to do an awful lot of OKRs and normally you know what at least in corporate speak when we talk about OKRs we always have to talk about the business justification and I'm a little unclear on what are the use cases so where we would want a 32 K or 64 K you know physical device with that kind of physical or logical block size and we kind of skipped over that part. So I don't use OKRs for business reasons I use it for my own personal development I use it for everything that I'm maintaining cooling modules and stuff and allows me to go backtrack and not go crazy about things that I think I forget about so I don't do it for business reason this is more for community help and try to track things you know for the community I'm not asking for volunteers for people do things I think that people are already doing some of these things I'm trying to track what people are doing in consideration for a large blocks I support. Right but many of us work for companies and if I'm going to go to my company and ask them that you know some of the community who is working on company time should do some of these individual projects. So think of it this way. The justification because we need it for something else. What is the justification for supporting large block devices. This is a guy are we doing this. This is a dialogue if you want if you do have a reason to support large blocks devices. This is an outline of work that could be done. Could I just point out that somebody representing CloudVan does yesterday said that a 16 K block size would really help some of their workloads. Yeah so what we said is that we need to support 16 K database page sizes so we have outlined something that we can do in reasonable time that does not require us to use large block devices. So because like there's a cost benefit profit justification for it and I can't think of a reason why I should spend corporate time working on this. There's a simple answer to it. Then don't. Yeah so this is yes experimental. Yeah and this is something that we think might be good idea which we think might be fast or even better in whatever for whatever reasoning for better. But this is something we simply wouldn't know until we tried. So all right. Let me make it a bit clear. If you do only want 16 K high order or aligned block device support for even though your device is 4 K guess what you're probably still going to end up getting a large block device just that it's painted in a different you know color. It's the same thing. We don't need to do all of the infrastructure. No no no just outlines. Yes. Yes. Yes. R's right. That's this I see like huge headcount requirements for this and I'm trying to figure out how we justify the headcount. I think we're getting off the rails here. So like let's let's talk about what we're doing. Yeah. We're talking about business stuff. We could talk about that on the side. This is not about business. This is if you do want large blocks I support. This is likely what you want to look into. Ancient file systems. I mean this is not just a large block device support type of thing. Right. You know there's already dialogue on removing Riser FS and I'm not sure if that's ever going to happen but it seems like what's four years or something like that that we're saying to remove file system. There's dialogue also possibly moving old file systems to fuse. There's complexities also in conversions of older file systems as well. So for instance I was told that there are some file systems that we don't have MKFS utilities for them. So testing that seems really complex right because you can't really recreate those file systems unless you just get the image dump somehow. It's really limited. Not even sure why we ended up trying to support those file systems but anyway if you have ideas on old file systems and progress on that go ahead and look at the page cache. This was the stuff that I just mentioned earlier regarding the inode address space using block bits and there's a link to Willie's comment on where that came from. And then we have the higher order folio support rationale is again on the wiki just look at that. Memory compaction came up but in trying to talk to Blasphemy it's not clear to me really if there's anything that needs to be done there. Willie? As I remember the memory compaction code has not yet been converted from pages to folios. I think it's more the allocation of fresh pages so right now I don't think it does very well at migrating pages from migrating non-zero order pages from one zone to another. I think that's just something that needs to get fixed. I haven't looked into that in any detail but I do think it is a problem area because it's still working in terms of struck page. Once it's working in terms of struck folio I will have delve deep into it or somebody else did because I don't have to be the one who does that work. Okay one of the things that became evident to me at least when doing experimentation with Shemem or Tempefest with higher order folio support was eventually also addressing swap with higher order folios. That's the swap cluster read ahead for instance and friends. There was a work also to evaluate Chinner's effort to support larger block sizes when it's greater than the page size and I essentially just tried to rebase all that work and Chinner basically ended up providing good feedback there and it seems there was only two patches really needed there. There's testing ongoing. First I want to test that to ensure nothing breaks. Make sure that there's a baseline there for at least XFS and a 4KN drive with larger block sizes and if there's no issues there and then hopefully try to see if we can promote getting that upstream once that's present upstream then maybe we can evaluate testing XFS with large block size on a real block large block device. BRD, Hanes has patches posted seems like he'll be following up on that and there's I'm not sure if other file systems are interested in supporting larger blocks sizes are there? So weirdly enough we already do this by default for metadata only because data is such a problem but metadata we default to 16K block size basically and so we just then you can go to 32, 64 or whatever but like because cow is unfortunate sometimes it's better for us to like do this in big 16K chunks so like you know your default Fedora install has 16K metadata block sizes on 4K page size. We'd love to do this for data as well it's just data is trickier right now. Right right. Okay so it's not it's not on the roadmap right now but it's metadata certainly. Yeah metadata has been there for years it's been working well. Data is we're doing IOMAP first and then hopefully IOMAP gives us everything we need to do. I had not considered metadata requirements for file systems that would be interesting right because right now we do the control for high order folios on the block size right and the question would be how to do that for file systems for metadata. Yeah this is where that like you know metadata abstraction thing that we have we just have like this like an array of page pointers that we we do so like at the point that we can start allocating full 16K folios and just drop that in there instead of having like 4K things that would be cool. Is the address base somehow used for metadata or not at all? That is horrible what we do because we have a lot of metadata we're not like EXT4 or XFS we have gigs of metadata like you know hundreds of megs of metadata that we write out at any given time so we have a fake inode with an address space that we hang everything off of and then we just set the aops to like nothing and then we manage it all internally. If you remember a few probably five or six years ago I like wrote a bunch of code to do like byte size throttling so balanced dirty pages this is the big problem that and this is why we do it this way is because with as much dirty metadata that we generate we can overwhelm the system and you know use too much memory and so we need to rely on the balanced dirty pages we didn't want to like redo all that work so we use the inode to take advantage of balanced dirty pages and I tried to do this in a generic way and I just got distracted by higher priority things but that's why we do it. Got it thanks. Yes. So to come back to Ted's business case thing which is basically this looks like a lot of work your chin a thing you said there are only two patches I remember that patch said it was pretty huge yes is what you're saying that once we've done the folio conversion actually what looks like 15 million man years of work will actually fall out that's correct that's correct essentially essentially your file system should be a lot easier to convert and add support for large block size support if you want that one of the issues of course is the metadata stuff right and currently using bufferhead yeah I think if Derek is on he can talk about what work might be needed for XFS and okay because I know you talked about this last week about supporting XFS block size larger than page size and maybe you could comment on the XFS metadata buffers or the XFS buffs that would be so as was previously touched on XFS has its own buffer cache hidden inside of its code base so we don't use address spaces or any of the strange things that butterFS does with address spaces so as far as metadata goes we already support having file system block size greater than page size it works at least until memory fragmentation kills you the only part that we that actually doesn't work right now is the part is the IOMAP part because we don't have a good way to tell the memory manager hey we want both multi-page bullios and they have to be at least this size because right now all the IOMAP code kind of assumes that the block size is always less than the page size which we can keep true if we could have if we could require multi-page bullios of say you know 8k for our 8k file system block size so I think what Luis is talking about when he says that this huge grody patch set collapses to two is that I think all we really need to I think the only piece that's left is just making the the memory manager give us 8k say pages or 8k folios and then seeing what falls out of the system once memory augmentation comes up and tries to eat us alive now you know Matthew has a has a theory that if everybody uses 8k in the system then actually it will be fine and we'll just reclaim things as we do now I suspect there probably are going to be other weird issues involving making it clear to the memory manager that if you want to reclaim part of this 8k folio say you have to reclaim all of it not just one of the two pages you can't really basically you can't split large folios down to something anything smaller than whatever granularity we established in the first place but the last time we actually anyone actually tried running XFS with 8k blocks on x86 obviously with the file data paths disabled it worked fine other than you know toy XFS became the toy file system where you can create directories and extended attributes all day but you can't actually store it read or write anything to a file so I think as far as XFS goes we're nearly there we just need some things that have traditionally raised eyebrows amongst the M.M. folks about hey are you got what kind of crack are you guys smoking hey I just want I just wanted to touch on something you said there when when the M.M. is is reclaiming memory it's always going to take an entire folio so it's always going to reclaim an entire folio so as long as the the the tricky part is keeping the M.M. from fragmenting larger chunks of free space so that's that's that's that's where we need to do and Johannes and Mel are currently arguing about a patch series on Linux M.M. that will try to do somewhat better in terms of keeping large chunks of memory available. Mm-hmm yeah I mean I think I think the answer is we really just have to stand up a test a bunch of test code to actually see what happens by you know make make the theoretically small changes to XFS have somebody stand up my sql or something on the on a system with the ridiculously small amount of memory and see if it see if the whole thing falls over any sooner than it does with 4k pages and 4k file system blocks. Right it gets it gets to the question of how to how to determine that easily you know what sort of testing do we want to do how do we detect that how do we measure fragmentation a lot stuff how do we replicate these sorts of you know dire situations that everyone is afraid of how do we replicate that how do we exacerbate that you know what do we do what's our test plan we got to really sit down and just carve that carve that out and yeah that's a part of the next thing here is testing basically we want to get to the point of testing first XFS on a baseline on 4k and drives with larger block sizes right without 4k and drives but then what does it mean when you start enabling it on a real large block device test how do we stress that how do we measure fragmentation how do we cover these things what metrics do we have available you know and this is part of a community dialogue and this is why we're here so maybe we can talk about that for those who are interested in letting talks on the side and so forth but that's what I have so questions hey Lewis this is Hitesh am I audible yeah yes so I do have you know a point to discuss look so is this the similar problem that I mean like should we have something like a multi-order block size like for example the similar problem that you know the memory management is is facing with multi-order folio so so so what my point is like for for data blocks like for example in exe4 we have a big block where where we can actually do say 16k blocks to be written but for metadata you know can we still can we still have 4k like rather than supporting you know I haven't looked into the 16k you know stuff from exe4 so I was referred to look at the youtube video so I will have to go do that so tell me be able to speak about that yeah basically like my point was that should we consider you know having different block sizes for different data types like for example if you have data you can have you can actually go and write a 16k whereas if you have you know metadata it is expected to basically track track that at 4k yeah so the reason why exe4 did big alex and we did this many many years ago was because the metadata problem was deemed too hard tm and the mm fragmentation problem was considered too hard tm and so big alex just simply allocated on disk chunks uh you know chunks 16k chunks but the assumption was the page cache could still be 4k and it would be fine in a folio world it'd be really easy to have a 16 or 32k big alex exe4 file system uh if we had a way to tell the folio system by the way this is a you know big alex 16k file system all the folios should be 16k or larger um it'd be really easy to do that and we wouldn't change anything else because the fundamental metadata block size is still 4k and that would stay the same so if what we're used in a folio world um this is actually pretty easy um the way exe4 did big alex and that was simply because we decided to solve the easiest parts of the problem first we were not trying to solve the general case um so yeah if my understanding is correct and then perhaps the same solution that would be used by tempfs might be leveraged by exe4 which is to eventually use the inode address space for higher order folios um but be a bit hacky but you know it's a different way to look at it no no then i'm not sure any other questions thanks all right thanks