 Hello, my name is Anders Reiniger. Some of you might know me, some don't. I'm working for SUSE, as you can see there, for, well, like in eternity, about nearly 20 years now. And I've been involved with Linux even longer than that. So the first Linux version I started off was, I guess, 1.05 or something back in them days, really that old. Anyway. And since then I've been active in various things regarding Linux. Recently I've been involved with storage and NVMe in particular. And this now is one of the, well, one of my pet projects, really, which finally came to life, namely the quest for large pages. So what is it? Why do we do it? So when you do IO, IO has inevitably been done in larger chunks called blocks. These blocks are currently limited by the hardware page size of the Linux kernel, which is typically 4K on an x86 box. And that is also, this page size limitation is also implicitly assumed by various drivers and subsystems. However, that's not the end of the world. Some systems and or applications actually might benefit from larger pages. There are certain databases, which really would like to talk in 16K increments, because that's how the database is organized internally. Also some hardware really would benefit if we could move to larger pages, larger sizes, larger block sizes, because then the overhead internally in the drives wouldn't be that heavy and the entire drive will become, well, more efficient and cheaper. So why do we even have that? I mean, couldn't we just say right this data go? Well, yeah, sort of. The problem is that you can't just do an atomic IO. There is no single assembled instruction do IO on these bytes. You always have to have several instructions setting up IO, transferring the data, getting back the results, and so on and so forth. This is obviously always increase the latency for each and every IO you do. So what you're trying to do is to minimize the amount of IO, meaning the actual number of IO, not so much the size of the IO, but the number of the IO. And then in the questions, right, how much IO is the best ratio? How much data should I transfer? What should I do here? There had been quite some bit of experimentation back in the early days. If anyone here is familiar with mainframes, mainframes still have the concept of variable block sizes for their drives, which are very interesting. So you have to decide on each and every IO you do, right, what might be the block size here? Interesting. And there had been a lot of experimentation going on and there had been researchers as Berkeley, of course, as usual, who eventually figured that, well, 512 gives you a good ratio between the amount of data you store, which typically tend to be very small, and the overhead you would incur by moving to larger blocks. So, but then again, this was 20 years ago. But we stuck with it and we still keep it that way for now. Okay, that's the IO. Now going to the page size. Why do I talk about page size and whatnot? So, the thing is that the CPU architecture has memory management unit. And that has some hardware assisted things, allowing you to tell whether a given memory area is dirty or not, i.e., where that need to be reread from disk to get the actual contents. So that's actually hardware assisted. And as it's hardware assisted, it can only operate in different sizes. So you can't just pick an arbitrary size, but the size which you can choose is actually given by the hardware. So say for x86, you have a choice of 4K to max, and I think 2 Geeks is the next increment which you can do. And that's it, nothing in between. For other architectures like PowerPC or ARM, you have a bit more flexibility, like there are still some power systems or even ARM systems out there which use a 16K page size. But this is just because the hardware supports it. You couldn't do that on an x86. And so, for Linux, we have this page size compile time setting, which tells the compiler and the entire source code. All right, the page size is now that value. So as we want to be compatible with other architectures, the common setting here really is 4K for basically throughout the board. I mean, it is certainly for SUSE. I guess it's the same for VEL, and so that's basically a common setting which we have. But still, page size typically means, right, that is the unit with which the memory management works. So now we have the memory management talking in page size increments, but as I already mentioned, well, we might need to refresh the data back from disk and into the memory and from memory on to disk. The thing which does that is called the page cache with greetings to Willi. And so that's the buffered IO because it's, well, we are not writing directly to this, but rather buffer the IO in the page cache, which is also working on memory pages. So guess what? Indeed, it's also working in 4K pages increments. And so there's the, when you work with the memory pages, there's a hardware setting. Tell you, all right, this page is dirty and needs to be refreshed. That then triggers the IO to refresh the, basically to read in the page contents back into the memory management. And so this will typically be done on page size increments. You could transfer several pages, but then clearly all of these pages would have to be marked as dirty such that the hardware logic behind it can work. And so if we had units which are larger than pages, then we would need to treat it them as a single unit. And that is called a folio in Linux. And that is the, oh, I don't want to be there yet. So coming back to the folios later. So, and then we have a page just telling the block layer, right? Okay, I want to do IO. As already mentioned, we have two types. That's the buffered IO and the direct IO. For direct IO, it's trivial. That's basically user space thing. Transfer this data now. All right, and you do that. So nothing really you can do because user space already tells you how the IO should be laid out. For buffered IO, it's different because that's from the file system. And the file system typically just cares about the amount of data. It doesn't really care how you need to organize your data internally. It doesn't really care. And so to do that, there are actually several interfaces. How you could do IO. The primary one or the original one is called buffer hats. Then there's a successor or an underlying structure called the struct bio. And then there's a thing called IO map. Again, coming back to that. But in order to do so such that we can transfer larger blocks, we need to convert the page cache to folios. Now there it is. I always waited for this slide. I always, as soon as I heard this folio thing, I said, oh, I need to have a slide with the first folio. Right. I thought, okay. Anyway, he got it. Oh, brilliant. At least one. Very good. Okay. So folios. Folios are basically a common structure for a set of pages. Because as it so happens, we are not only having normal page. The management also knows of other types. There are things called compound pages, which essentially is an array of pages, and there are things called huge pages, which is an improvement from, well, several years ago. Initially, it was a separate file system. And then it got into something more flexible, which is called transparent huge pages, THP. Some of you might have heard of that one or seen it somewhere written in LWN. And each of these has their own peculiar work, which makes it quite odd, because all of them could be addressed via a struct page. So if you have a struct page, you really have to know, is it a struct page or maybe something else. We got into some really funny issues there with when trying to transfer pages down to the drives. That's, say, the send page okay one, because that tells us, right, is this really a page or do I need to do something else? And so Matthew Wilcox invented a structure called folio, which is basically just an overarching type for all of these distinct things. So all of these various types can be addressed via a struct folio. The important bits in our case here is that the folio can be larger than a page, which is nice, because that is precisely what we need if we want to transfer larger blocks, because then we can identify one of these large blocks with a folio. That would work. And with that, we can do, in theory, large block IO. However, that would require us to, well, convert the page cache at the very least, but more likely even the memory management over to folios. That was first proposed by Matthew Wilcox in 2020, and had been a prominent talk for the Linux storage and file system conference ever since. And at all, if you could imagine, had lots and lots of controversial discussions at one point, someone even refusing to merge it at all, because why would you need that? Well, yeah, we do. And it's an ongoing work. So this is just a simple counter for the number of invocations of struct page and the number of instances of struct folio. As you can see, well, we have a long way to go. So this is ongoing work. We will get there eventually, but we are certainly not there yet. So, Buffered IO, what do we do? How do we do Buffered IO? So this now is, some of you might know, that's the diagram of the Linux storage slug, as you can see. And the really depressing thing is we are just dealing with this little grey rectangle in the upper right corner. So that's the area we are looking at. All of the remaining things are none of our concern. So, Bufferheads. Bufferheads are the original structure present since 0.01, meaning the very first instance of Linux. And that essentially is a representation of a sector, of a disk sector. It is 512 bytes, or assumption is 512 bytes. It's linked to a page and there's internal caching, the famous Buffer cache, which saves on doing IO for each and every access to the Bufferheads themselves. These are still in use by most file systems, actually. And there's also pseudo file system for the block device, which is also using Bufferheads. And the page cache itself was only implemented later, because Bufferheads did their own buffering. And just for the fun of it, this is the actual structure of the Bufferheads. And so, as you can see, yeah, it is quite something to store in. And the question really is, do we need all of that, or can't we have something simpler? And that's what ended up to become the BIO, or basic IO structure. This was invented by Jens Achspo back for 2.5, which is basically the basic IO structure for the device drivers themselves. So, this allows you to re-vectorize IO, meaning you can have an array of pages attached to a single BIO. And you can route and re-route the BIOS to devices as you see fit. DeviceNeper is the example, because that's doing precisely that, re-routing, re-formatting BIOS just to make whatever it wants to do, rate, or video, or something, you name it. And that's actually the primary structure for the block layer. And the Bufferheads are nowadays implemented on top of struct BIO. So, the Bufferheads will be converted into a BIO, and then that will be sending out the actual IO. And the BIOS themselves are used by quite some file systems, AFS, for example, others, such that those file systems won't be using Bufferheads, but BIOS directly. And then there's IOMAP, as I said, or Chris of Helwig, gone crazy. As he is want to do, he's not here, can we not record this? Sorry. He got fed up with all these various structures for IO and invented his own thing called IOMAP, which is, well, the modern interface, which thankfully already operates infolios. And thus, this basically just does away with all the intermediate structures. It just provides a way how a file system can tell the block layer how the IO should be mapped, and then it's up to the block layer to map it to lay it out correctly. Some file systems have already been converted, and that has obviously called us directly into the pagecache to enable it to work with the pagecache. Some files have already been converted, most notably XFS, ButlerFS, and SonarFS. Someone knows about that. So for these, clearly nothing needs to be done. But documentation of that? There must be some. Only it's hard to come by and not very accurate, because that interface keeps on changing. I mean, it is under active development, so every new release, you will find new features, which are not really that well documented. So what do we need to do to actually get to the point of doing large blocks? This is just for fun of it, just googled a large block, and it said, yeah, it's an area with more than 500 square meters. Not quite what we want. Anyway. So there is this long-standing trend in the storage community that buffer heads must die, because reasoning is that it's an ancient structure, and really it's a legacy interface, and everyone should be converted over to a struct bio or to IOMAP. And then there is this quote from Monday, I guess, or Friday. You can read it for yourself. I'm not sure whether I can read it without violating any code. And so maybe this direction is not what we should pursue, not if we want to make any want to have Linux to merge that code. So what can we do? So, yeah, conversion to folders is nice, but really only affects the page cache and the memory management substance. So we need to do more there. And the problem is that buffer hats actually assumes that I will be done in smaller instances or increments than the page size. We now have the opposite. We now have larger increments than the storage page. So several routes which you could take. One is convert to IOMAP, then we could obviously just switch it off, just don't use it and compile everything out which it does use it, or could try to update things. So convert to IOMAP. In an ideal world, that's what we would want to do, because IOMAP is a modern interface, and it's actually quite a nice interface. I mean, still Chris of Helvig, he does have nice ideas and workable ideas. It just, he also has complex ideas. So some files have already been converted. So in an ideal world, we would convert all files systems to it. So if someone has followed the discussions on the kernel summit mailing list for the next maintainer summit, there's a very long and lively discussion what we should be doing for legacy file systems, because while these are legacy file systems, and the problem with that is, well, there is not so much an active maintainer for it, because well, he has long gone if it had ever existed. Sorry about using the pronom he, I will be inserting the proper gender correct forms in the recording. So, and of course, the documentation of these file systems is hard to come by if it ever existed. Most of the legacy really, especially the old legacy file systems are actually reverse engineered by what they have seen on the disk themselves without any idea why it had been there. So how would you go about and convert these? That really is a hard thing to do. And of course, you would need to have a proper documentation for your own app to enable other developers who are not that familiar with IO map to actually do it. So, hmm, possibly not the way we could be or should be going. So, of course, we could also just say, all right, whenever there's a buffer head, just compile it out. There's a patch set from again, Chris of Helvic, which did exactly that. So you could just delete all of this because, and then trivially, both buffer heads won't be used anymore, and there's no discussions to be had. This patch actually went in in 6.5. So with 6.5, you can, well, you can actually switch it off. But if you compile out all file systems which use buffer heads, buffer heads will be switched off implicitly. So, hmm, that is a bit of an odd interface. But yeah, it is backwards compatible. It's just not only it's really hard to get there. And the interesting thing is that some files which actually are in common use like FATS or X3 will then no longer work because they won't be compiled in, because they haven't been converted. And, of course, there's always the possibility to update buffer heads. This was actually the direction suggested by Joseph Bacic at the last LFS in Vancouver. So, well, you could just convert the whole thing, convert to folios, and then see that it, that you get rid of the assumption that I would need to be smaller than the page size, but also could be larger than the page size. So, hmm, it could be quite trivial, because in an ideal world it's already coded that way, that everything works, or it could be a complete nightmare, that this assumption that you always would have I always just smaller than the page is basically implicitly coded all over the code and you would have to do a full audit of everything. So, originally, when I heard about that, I said, that is not the direction I would have wanted to go. And, of course, there's the Ketermkenzio, that buffer heads and so on. So, later that day, I've been sitting at the bar after having looked at the code and said, oh, God, that's a complete nightmare. How on earth is someone supposed to convert it and what are doing buffer heads at all? Why do we even have that? Complaining bitterly to my neighbour. I wanted to figure out later that this was actually Andrew Morton who said, well, back in the day when I wrote it was quite good and it still works, doesn't it? So, but thank God, not something which a good drink wouldn't fix afterwards. So, yes, but then again, it really made me reconsider maybe Joseph was right after all. But if you're updating buffer heads, you get into all these well-grubby details, which you really wouldn't have wanted to thought about earlier on. Like there's a void pointer attached to the struct page, which, well, it's a void pointer. And that points to the buffer heads if using buffer heads. If using IOMAP, it points to the IOMAP structure. So, hmm. And then figuring out that you're actually running in the page cache, meaning that this page is shared by everyone attaching this, accessing this very same page. It really makes a difference whether you're talking to buffer heads or IOMAP. And so, basically, it means that the page or the folder can either work with buffer heads or with IOMAP. And that is a problem for the block device because, well, as it so happens, surprisingly enough, every file system is running on top of a block device. And if that block device is using buffer heads, well, the file system better use buffer heads, too. Otherwise, you might get interesting results, like very nice color crashes. And so, the mix and match approach is something needs to be considered carefully. And the other one is that, well, UFI obviously requires a file system. So, if you want to switch off buffer heads, you wouldn't have a file system. So, booting UFI machines will be very tricky. And, of course, review will be hard, because you need to spot all the dependencies on the page size or the implicit dependencies on the page size, like an increment by one if you're in the right pages. So, and then again, why do we do that? Is it really worth it? Well, I think it is. But that's just me. So, the one thing which I know for a fact is that database really want to do larger IOs. So, that definitely would benefit from it. And the hope is that we're getting more efficient IOs, because in most cases, file systems already will be submitting larger IOs. I mean, butterflies is going through a lot of pain to ensure that always large IOs being sent. And XFS similarly. And, of course, we would make the driver vendors happy, because the drives they make can be more efficient or cheaper. So, is there anything which I did? Or is it just talk? Well, I had been happily coding away and basically had been finished my patch set as bonus one to do last week, when suddenly Louis Chambalain popped up and sent a patch set. Oh, here's a patch set converting everything to a large block. Thank you very much. You could have talked to me. So, as I said, I do not work for something. I have nothing to do with their work. And this what I'm presenting is entirely my work and no one else's work. And so, of course, I will be talking to Louis and his colleagues to join both our approaches to come up with a combined patch set. But yeah, isn't open source great. Just when you think you're done, someone else did the very same thing and faster than you. All right. That's the way it is. So, anyway, what did I do there? So, I figured that if we were to move the buffer heads off from assuming that it's attached to a page to assume it's attached to a folder or the underlying concept would still work, then the IO onto buffer heads would be still smaller than the attached unit, namely the folio. And we could keep all the accounting, buffer head accounting whatsoever in place because the overall rules won't change. And so, also, the page on this case, the folio would still have a pointer to the buffer head, a single pointer to the buffer head. And that would keep the number of changes to the buffer head or the page cached at a minimum. But then you have to adhere to some certain guidelines when converting. So, the problem is that you suddenly have different units which you carefully need to look at. Namely, the memory management, everything page-wise, everything the memory management is working on, it's still going in 4K or page size increments. The buffer cache will be working on folios, which will be on whichever size the folio had been allocated. And the buffer heads themselves also work on folios. So, that's fine. So, the buffer heads can do IO, but the buffer heads actually don't do IO. The buffer heads, as I just said, are sitting on top of the block layer on top of stroke bio. So, what about them? Well, and that's where it's getting ever so iffy, because the block layer is working on 512 bytes increments by block size, full stop. That's the logical block size for the block layer, and there literally is no way of changing that. That is built in each and everywhere, and no, you don't want to change that. Thank God, that is not what's doing IO, because IO is done on the lower level drivers. And the lower level drivers already merge adjacent pages or adjacent bytes together to a larger unit. So, if you feed them with a larger unit to start with, it will be enumerated in 512 bytes units, but the driver themselves will be reassembled into the original folio, or the data which is being pointed to by the folio. So, that means there's actually nothing to be done. It should just work. Just work. And it's not really the obvious way, but yeah, seems to be good for now, or we should work for now. So, and that's essentially what the final patch did. Of course, there one needs to order the page cache to allocate folios and to ensure that everything is really working on folios and really is incrementing by the size of the folio and not by page size. And then there's also we need to transfer the block limits, because really the driver which tells us which size we should be using in the page cache, so we need an interface for that. And that worked quite well, actually too well, because the first patch that I did also used NFS, and it turns out that NFS to be, well, more efficient, tries to transfer really large chunks like 128 megs, which surprisingly worked for a certain time. So, a copy worked for quite a long time until it finally went out of memory, which neatly proved that for large blocks leads to a higher memory fragmentation, if there was any proof, there it is. So, was that it? Well, sort of, because nice that we enabled the Linux kernel to talk to drives with large block sizes, but really there are no drives with large block sizes because no one can talk to them. So, what I did was update the BRD, the block RAM disk driver, to actually display or support large block sizes so that you have some test bit to test it out with. And that proved to be an easy test bit, so it actually worked quite well. And you could even, surprise, surprise, use this as a backing device for the NVMe target. And voila, even NVMe could speak with large block sizes. So, that was quite cool. But of course, there are still some testing to be needed, especially this splitting and merging in the block layer needs to be tested. And, yeah, it seems to work for my case, but well, what am I to say? And of course, yeah, what else? So, QMU would need to be updated, because for QMU it would be quite trivial just to support large block sizes. In theory, again, you would need to just need to modify the drivers to display or to announce large block sizes, just. I haven't looked at it, so, as usual, in theory everything's easy, but then looking at the code and knowing QMU it will probably split it all over the case, all over the code. So, I'm not sure. And, yeah, you would need to exercise the drivers with them and you could test with other subsystems. And, yeah, the other one is that I would need to unify the patch with that one from Samsung, but chances are that I'll be speaking to Louis next week and the hope is that we'll be able to merge both of them. And, of course, there are the usual reviews and follow-ups and everything. So, and then there's the issue of the memory fragmentation, which will be a real issue once we move to larger block sizes. I guess it should be okay if we're just talking about 16K, but still the problem is that memory management continues to run on pages. So, any internal allocation which would probably still be done on pages, which means that we will be having enhanced memory fragmentation there. We could maybe move away from that if we switch the entire system to just allocate with the block size, but that assumes that you have the block size. If you have several drives each with different block sizes, it doesn't really work again. One possible thing is, which might be worth doing anyway, is to update Slop, meaning the memory allocator, to run on higher orders for you than just the page size and obviously convert each and everyone who's using alloc page to just use Slop. That would get rid of the internal allocation and with that, I hope that it might be possible to reduce the memory fragmentation up to the point that you can do away with that altogether, because you will always allocate in folio size increments, but then you still haven't solved the problem. What do you do if you have several drives each with a different block size? So, for that, I really don't have a good idea. And of course, I'm open to other ideas if someone has ideas. And of course, I need more testers, because that's something which really needs to be tested. And well, and in case you're really bored, there's still the block layer, which as I said operates in far from 12-byte units. And so there really is no good way how we could how we could change that. Thank God the data itself is not stored in the bio, but rather in the attached structure in the vectorized zero, that's called the BioVec, which is basically that structure. And for that, it should be possible to, well, move the struct page to the, well, struct the folio and just use that or even have a union which would allow to either access it via the folio or the page, because really the first bits are the same, so you can have a direct, basically you can have a cast from a folio to a page and vice versa. And so that would be possible. I haven't looked at it. I haven't attempted it. It should be possible, but again, this is for someone really determined to do it, because that is no fun whatsoever. That is basically all over the place. I just checked the struct folio in the block layer is mentioned 10 times and the BioVec is mentioned about 4,000 times. So what should possibly go wrong? Yes. And with that, I'm actually done. Hey, it looks far better on the large screen than on my laptop. Okay, fine. And yeah, I did something. All right, good. Anyway, thank you very much for, and I'm open to questions. I'm still open to questions. All right, so you can always find me later on. I'll be sticking around a bit longer here or open the lobby and the hallway. Thank you very much for your patience.