 Is Derek on the line, by any chance? Yeah, I'm still around. Awesome, cool. So I'd like to thank Derek and also Christoph for helping to review this documentation. So one of the things that I hear at LSFMM for a while now was, oh, you know, IOMM is complex. It's difficult. Well, hopefully this might help. So we have some series of guidance, some sort of concocted documentation, and brainfarts, because this is still evolving, right? Because we only have so many file systems using it. So take a look. Basically, you can just look on Google for kernel newbies and then just slash kernel projects IOMAP. And then you have some slides. But I guess, let me just try to pull up the slides. In the meantime, do folks have present pressing things regarding IOMAP that they would like to discuss? Otherwise, we can review some of this documentation and then review questions at the end. Yeah, it'll allow me to pull my documentation in the meantime. Yeah, of course. The one thing which I found most confusing about IOMAP is the units of the arguments are these bytes, sectors, pages. So really, it would be helpful if we could have, and well, in my documentation with the header files IOMAP, this argument is in sectors, and this is in bytes. Because that really tends to confuse things as IOMAP, as of now, only works on pages. So page size is actually a viable thing, because that's the only thing they can operate on. But at the same time, you can also give them a sector argument, which then indicates, right, I'm only interested in the sector, which belongs to the page, which is part of the page, which you are about to read. So it's a bit confusing, because you don't really know, right, what is it I'm talking about? I need to put input here. That's one thing. And what I has been desperately missing is the relationship between the various ops which you have. Because I've counted and got up to at least, well, three types of different ops. So you have the read ops, you have the write ops, you have the write back ops, and you have the, yeah. And why, and why are they different? So yeah, I actually had the same exact reaction when I started reviewing this, because I came from the outside trying to first understand it, and I was like, what the fuck, you're right. So I looked at this, and I didn't know. And Derek made it clear in the mailing list, and it makes sense in retrospect, right, as you're trying to go, think of it this way. You have a routine, you have all these different types of operations and all these different possible flags. You're going to be branching the hell out of that routine. It's going to be cyclomatic complexity, I think it's called. I don't mind. So basically what you had them doing is instead, you basically just have dedicated operations for very small, specific operations for the file system. That is fine, if they're documented, but they're full. Well, that's the point here. So that's the point here, and so now, the way to think about this is try to target your file system by the different main operations, Buffered IO, Direct IO, Read. There's one for extent mapping, so let's review that right, because I think there are some important warnings about BMAP here that Derek made, and Christoph too. Who here is a fan of BMAP? Great, it's a good thing that people laughed instead. So there's warnings here about BMAP and why it's not ideal. So pretty much, you split it by these operations and you have very specific targeted IO map calls for them. So it's really, IO map is an iterator for bytes, for ranges, and it also tries to help replace the old block range operations that we have. One of the things that is missing here is also that there's no helpers from metadata, right? So you either have to implement it yourself like XFS does, which we were talking about, right? And this is why it's an issue, right? So you have to consider that for a new file system. I am curious. So that is essentially part of the, Buffered's discussion we had yesterday. Yes. That for me to data your iterator will have to help us because you can't use the average, you're about to set up because you're about to set them up. The question there I think would be and should be, I think, is which file system would be willing to port over to something new, right? It doesn't seem like there's enough interest in its understanding. Sorry, you can't force anyone to consider to port over, right? So in the end, I guess IO map is the way to go. Basically it's the way how files should be designed nowadays because that's a new API. Yeah, absolutely. And also, from my perspective, IO map is basically a prerequisite if you want to go, if we want to go to larger block sizes. Well, yeah, absolutely. That's kind of like- Because I had to look at Buffer hats and, sorry, Joseph, but converting Buffer hats to over to folders is, no, no, just say no. And so really the only way how we can make this work- Wait, say it again, converting what to what? Buffer hats. Buffer hats over to folios is either a complete no brainer because they're already using folios so there's nothing to be done or it's a hell of a work because you wouldn't redesign the entire Buffer hat thing. Right, so, right. So essentially we're going to, it seems from the other meetings we had decided that we're going to keep Buffer head path with order zero folios. That means essentially you don't get large blocks as either, so. Yes. IOMAP is the only way forward for large block sizes. Yes, but at the same time, if you're running with large block sizes, you have to make a distinction somewhere. So the current patch that which Christoph did was make this a conflict option. Yes, correct. And it really depends how this patch that moves forward. I mean, I only found so far, well, there's a couple of bugs there, but think of it this way, right? If you don't depend on a file system that requires Buffer heads, why would you need it? But I do. Yes, okay, well distribution does, yes, yes. No, not the distribution. You might want to boot your system. And this system might be running, and this system might be having a UFI file system. Which you actually need for booting. Again, I'm not sure how Amazon Linux did it, but I can tell you they're not using VFAT at all. And they're brilliant for Amazon. Yes, good. Well done. It doesn't help me. I'm saying it seems to be possible somehow with EFI bars and system D. I don't know how they did it. Exactly, so there's no way, so there's no way how you could be doing this because all systems, the systems I am running on have a UFI firmware. And there's no way in hell I can change it. And this UFI firmware will be having to read a VFAT in order to boot. So there will be VFAT, inevitably, and most USB sticks. Which, and so yes, I might be able to boot a system which doesn't have support for VFAT. But that means I can't access the VFAT from the system, meaning I can't update it. So really it's not. Anyway, it doesn't really matter because I have converted it, so that is not the point. So you have patches for that? Yeah, I'm working on it, yeah, I got that. Awesome. I can read stuff. I might be able to write eventually. Haven't tried them. Awesome. So for those who haven't considered seeing if it's possible to convert your file system over, here's some generic guidelines. I'm sure folks who have already started working on that may be able to provide some input here. The general guidelines is to try to look at direct IO first. Would your mind change in order there? Well, you know, it's up to a community, right? So what do you guys think, you know? Yeah, so one op at a time is correct, but direct IO should be listed last. Because that's the one that you really want to trust. Here, please, lost. There's the two direct IO first. So ButterFS has direct IO map because the buffered one has been a huge pain because it involves completely changing how all of the read page, and write page, and validate page, and all of those things work. And like the write back, call backs, and like every, like buffered for sure deserves to be last. Direct IO is a lot more straightforward because the old code and the new code are very similar and it just like takes away a lot of the extra scaffold I mentioned before. Yeah, I think, yeah. So I think the other thing that's probably worth sort of inserting a cautionary note here is I am not sure that trying to convert the simple file systems first is a good idea because some of the necessary infrastructure to make life a little bit less painful isn't quite all there yet. And one of the really big examples that we're given is the metadata reads and writes, right? We don't have a solution for that. So you can convert a simple file. First of all, many simple file systems don't even do direct IO at all, period, right? Like, you know. Then the solution is done. Yeah, like AmigaDOS, the AmigaDOS file system doesn't support direct IO. The HFS file system doesn't support direct IO last I checked. So saying do it first isn't really gonna help, but what they do need is the ability to read and write metadata blocks. And until we actually give them an easy solution, don't ask them to convert and then they run away screaming. So maybe like, I think there are two different things. One is converting the data path to IOMAP and that can be done very much now, I believe. The other thing is getting rid of buffer heads, which is a separate discussion. And that I completely agree with you. We have to have some sane story for them. Yes, but also I would like to point out that we shouldn't force anyone. So that's the main idea behind it. So that is the goal we want to have for what we can afford and force anyone. Do you now convert this file system now? Yeah, sure. We're not going to work. So if you find someone who has enough interest to do the work, by the means let them do it and give them some, like this one, give them some hand-holding, right, okay. Have a look here that roughly tells what you need to do. That is good enough. And then if people care, they can see, all right, where are we? So, or is the use case now something I can do or something missing? And if there's something missing, like say the ISO file system, right. Maybe he cares enough to convert it. If he doesn't, right, so he doesn't get an ISO file system. That's, yeah, tough shit. Yeah, with the IOMAP, so just for your information, so we are, I'm going to merge for the next merge window, like I have the patches already, but conversion of EXT to direct IO pass to the IOMAP, so that's going to be queued, which changes also some of the BFS stuff to make it easier for file systems to, like the simple file systems to convert to IOMAP. Because there are some problems with, like sync direct IO currently where it's impossible to reuse current VFS helpers with IOMAP, basically. And like complex file systems like EXT4, BTRFS on, or XFS don't care because they do their own thing anyway, but like for the file systems that mostly use VFS, like they need changes. So that's being done and queued in my tree. We are working with Ritesh on converting EXT to the data path to IOMAP. I'm now speaking about only data path because as I said, I believe there, the generic story is there and it really only needs deconversion. Then the second part is handling the metadata and for that, I don't have a good answer. There's something I've been talked to at Avili about this, but I mean with this FUD thing, I need to have something like this and I already have the FUD, I only have enough, but really what we figured is that in the end, it doesn't really matter for reading whether we read 512 bytes or entire page, doesn't really matter at the end of the day as long as we get the pointer to the data which we have been looking for. So that's one thing which we hope to change that you can read a folio, given the, as you put in the offset bracket, this is the index I want to read, then you get the folio back and a pointer to the data where the data starts and then you get the every change that you're looking for. I guess that should be getting us quite a long way because that's roughly what we need and then we just need to figure out what do we do for writing. Because really writing should be done by right pages or with modern file systems and with modern things, or everything should be done by right pages, but this is not quite how the original metadata we didn't write did work because everything was synchronous, you would just write individual blocks and that is it. So we really have to get close to see what data is feasible, whether it can be where we can do that. Because it means that everything will be bunched up and we are writing out whatever's there. We don't really have a control about what will be written. So it needs to be figured out, but let's see, we are getting this loading. And yes, you are completely right, we have to put this, once we have to have, we have to put it up there to tell them right there. This is the way how we could do the metadata. So this is the rotation. I do have to kind of maybe add a couple of things that I noticed while we are working on converting EXT-2 Buffered IO path as well. So, I mean like apart from EXT-2 directory handling code, at this point of time, although I don't have a direct code for sharing, but I don't think so. We have a lot of problem for let's say, even EXT-2 to kind of go to Buffered IO. Having said that, there are a couple of open problems in IO map path which kind of needs to be addressed. One, I think there is already a path series that is out there. Thanks for all the reviewers who are kind of helping with that path series, which is basically tracking per block dirty tracking in for you because otherwise you have. Thank you, thank you, we need this. Yeah, and the second one I think is one which Matthew has pointed out. There is an existing problem which is like for legacy file systems, there is a flag in Buffered which is BH underscore boundary. I think that was mainly used for making sure that you don't have an IO patterns which can basically kind of cause your performance problems. So, what I'm trying to say here is like for example, file systems like EXT-2 which is an indirect block, you can have zero to 11 blocks and then you can have an indirect block which is coming from a discontiguous block area. And so when you basically submit a bio, if the bio gets rearranged, you can actually have a data access pattern which is not really good. And so that was the reason BH boundary was actually got added as a flag in Buffered. At this point of time, I think IO map doesn't have that. I don't think so that there is a way to handle that. Yeah, maybe that would be one other problem after we basically do the dirty range tracking fix. I think this is one of the open problems that IO map might kind of need that. I think the good news is that that particular issue is an issue for those file systems that are actually using a V7 Unix style indirect layout or system, VFAT doesn't use that and more modern file systems that use extent mapping don't need it and so that's an example of something where IO map may eventually decide that they wanna actually support that so that we can get a higher performance for these simple file systems like EXT-2 like Minix like UFS or there may be a decision that we don't care about performance for those older legacy file systems and so maybe we'll just live with a performance hit on those file systems and so again, I think this is an example of where I would certainly recommend going back to this documentation that there needs to be a big warning, here be dragons. There's an ongoing conversation. This documentation may very well go out of date because we're trying to make life easier for people and if things look really scary right now, hang on and really I think the targeted audience for this should be for those early adopters who are willing to live with the fact that things are still sort of under construction and we shouldn't promise that it's gonna be easy because it's not easy yet. That was also my consideration but at the end of the day, most of the old file systems have outlived their usefulness so they're really just there for let's say legacy access. So like say VFAD, yes, you're not getting rid of because it's actually built into the hardware insert system like X86 but really you only care that you can access it. You really not caring whether it's fast or something so I'd be perfectly happy which does make it ever so slightly slower because chances are no one will notice. Same goes for things like ISOFS. I mean, yeah, yeah, I know but it might be slow but then in the end it was for slow to start with so we're not losing anything but just making it a tad slower if that were the need. So I wouldn't let us hold up by performance considerations because in the end of the day, if you care, write a decent file system. Another thing is testing, right? So Junkara had mentioned one of the LTP tests that he ran to test the Direct IO for instance. Is there any other sort of test that folks have been focusing? So I am more, obviously the FS tests, yeah, like there are some LTP Direct IO tests but like FS test exercise that I would like so. There was one lingering other thing, curiosity when one was revealing the state of the art of IO map and its adoption and one of them was, why does a butterfly only have one? And the reason basically is it focused on Direct IO and there's a lot of work that Goldwyn has been doing if you see his patches, you'll see a lot of work there, fixing a whole bunch of locking stuff. That work needs to be done first in order to start working on the different IOPS. Is Goldwyn on the line for any chance? Yes, I am online, yes. But sorry I came in late, so I'm just wondering if you wanna share any, do you wanna curse IO map or anything you wanna say about IO map at this point in time? Well, just to tell you about the state, I'm working on it in the sense we had the problem of page extent locking within page locks. So that had to be reversed, which was quite a task and once that is complete, we had to start with the IO map work on ButterFS. It is using a lot of hacks as well right now, at least at what the state is right now with having page private set by ButterFS. That is kind of overloaded by IO map with their own sub-page data structures. So there are a couple of hackish patches, but once it gets into review, hopefully I'll have better ideas once people start reviewing on this. I hope to do a code drop by the end of this month and do the first part of it and get some review comments on it. I'm pretty sure there'll be a lot of discussion with respect to how we go forward with that, but that's what my target deadline is right now. I was just checking regarding the boundary handling and I actually don't think we really need to do anything special there. Basically the boundary handling is workaround for the fact that when you are mapping block by block, you don't know about the contiguous extents and we don't know whether this is the last block in the contiguous extents or not. And basically the boundary handling is there so that the file system can tell you, this is the last block in the contiguous extents. So just submit whatever you have because I will need more time to get you anything else. But with the IOMAP, we have this implicitly encoded in the fact that we simply return all the contiguous extents. So after each extents, we know that we are basically in this situation that likely now it's the right time to submit. So like long story short, I believe that if we just ignore the boundary thing, then we will be mostly fine. Right, okay, because of the way I think we map it, I think you're saying that the file system will always return that contiguous range and so we might not even require the BH boundary to be handled. Yeah, exactly. I was checking how the boundary flag is actually used in m-page and direct IO code and it's basically used exactly in the sense, like now the next block is likely going to be discontinuous. So let's submit whatever I have because I'll be seeking soon. So, but IOMAP, this is implicitly done by the fact that IOMAP begins and returns the whole extents already. Yeah, yeah, yeah, thank you. Well, that's pretty much all I had. I guess I'd like to ask also folks to take a look at the documentation to get an account on KernelNeebies.org, just request one. You won't be able to do any edits. It's a simple edit onto a page where you're an editor in the editors group and we just have to add you to the list there and then basically just go to town. So please review, especially those who are actively working into converting file systems at this point in time. Let us know what you think or just please just edit it. Don't let us know actually yet, don't email anyone. Just request an account, email us if you already added it and then just ask to be added to the editors group. I think the tab here on the left or something like that. And then you could just go to town and edit and then maybe in one or two kernel releases, we can try to strive to get this upstream into the kernel as proper documentation. That's unreasonable. Soon is probably better, but isn't it easier to keep this in a wiki before it gets into get form? All right, so fuck it. I'll just try to do a patch that adds this right now. CZ people and that's it, we go to town there. All right, we'll do that. Thank you.