 Hello, so we have a problem with cow file systems with respect to memory usage and this is primarily for shared extents. I'll just give a full steam of what shared extents are all about. So if you have two files which are sharing an extent, this could happen due to say clone file range, snapshots, or reflinks, so a normal copy operation with reflinks. So you can have shared extents where files, file inodes point to the same blocks in the device. And here in this example you have file one with two extents, E1 and E2, E2 being three blocks and file two being E1 and E2 and E2 again with three blocks being shared. But when you read it into memory, basically each file reads it individually, which basically means you'll have two copies of the second extent for both file one and file two with basically page two, page three, page four being read twice. So this is not just about memory. It's also about the effort or the cost involved of reading it from the disk. So like with ButterFS you'll have additional cost of checksims or decompression and things like that. So my idea is to start with having a device cache altogether. I hate to call it the buffer cache because it will give you nightmares, but let's just call it the read device cache, which kind of sits in between the device and the page cache. So for each, let's, I have different operations like read, write, direct IO and M-MAP. So starting with the simplest ones, the buffered read would first check the inode page cache. This is primarily done because in case you've read something and written to it. So first you check the inodes page cache. Then probably in read page callback itself, you convert the offset to device offset. Of course, if it is not found in the regular page cache, you convert read offset to device offset and then check it in the read cache first before going to disk. In buffered write, the situation will be a little different. It was always performed in inodes I'm mapping and so if there is a situation where you have to read, say if you're doing a partial write for a page, you have to read it from disk. You first check in the read cache before going to disk and reading the page altogether. The harder ones are the direct IO basically with respect to reads. It kind of defeats the purpose of direct IO because direct IO basically means that you don't have it in page cache, but we are having disk cache altogether for your inode pages. If there is a write, it would first need to check if there's something in the read cache and use those pages. When I say it pulls the pages, it has to delete it from the read cache and then pull it out. Anything better? These ideas may be pretty outrageous. I keep making the same mistake that you're making here and Dave Cheney keeps setting me straight. When you're doing a direct IO read, you actually have to go to the disk and get it to do a read. There's a couple of reasons. One is that there is such a thing still as shared storage and so you actually need to read what the other machine wrote rather than what you got in your cache. The other is that some workloads are doing this to save CPU. Have the CPU do a mem copy? They actually want to go to the drive and they want the DNA to happen because to that application, CPU is more important than bus bandwidth. Yeah, but first of all, to answer your question on shared storage, I believe that will partly be with respect to cluster locking or something of that sort. So yes, in this case, it will probably have to be invalidated across the cluster, but for other reasons, I cannot say much. Yeah, M-Map is sort of a gray area for me. So I'm really not sure if this would be the right way to go ahead. So it starts with a read-only shared mapping like we have regular BMShared objects and if there is a write, it goes through the write copy on write page and we use it for writing on a M-Map page. So when I started off with this, I said we'll put it in BDI node, which is the backing device of the super block. But eventually that may not be such a good idea, primarily because ButterFace has multiple devices. Not only that, we also have things like compression or something which would basically keep things off-scaled. So really suggested of adding a special I-node altogether called for shared I-node or we could stick it in the super block, however it works out. So I have a couple of questions which we could think about for discussion. First of all, things like, first of all, I know it's pretty outrageous with respect to having the whole reads, all reads into the read cache. So probably the reads would have to be differentiated, whether it's a shared extent or not. So should we read the entire reads into the cache or should we just have shared extents into the cache, the read cache? So the way I envisioned doing this is that we have yet another special I-node that does this. And for ButterFace specifically, it's just mapped to our logical byte number addressing. So it's not tied to the block device itself, it's tied to our internal logical address space. And then from there, it's an I-node. Like the OM and all that stuff, it doesn't become a problem because the MM will come along and tell us to evict pages and we'll evict pages, so that part is taken care of. So that's a death to all mount options, like this just happens. Yeah, so this, so that answers one of another questions, whether it should be a mount option or not. We should directly put it as a file system. Yeah, it should always work. And I think for, at least for ButterFS, I think we want to always cache reads because you never know when you're going to snapshot. Because you can always, like if there is a read cache already, you can always have called clone file range later on, and then it'll become a shared extent automatically. Right. And I think for the ODirect case, you know, will you bring up a good point? But I think just in general, I am not, there'd be dragons here, and I'd rather just invalidate page cache in everything for the byte range that we're going to mess with. And like you do ODirect, like you don't get the fancy sharing stuff, and I think that's okay, right? Because ODirect applications are going to be specialized anyway. I don't see a problem with going the extra mile to evict cache from the read cache in the case of ODirect reads or writes. Yeah, but that would also mean that we would have to translate before, which is fine, I believe, because the direct IO will be translating from file offsets to disk anyways. But yes, I think it makes sense. Yes. Yeah, I think it's actually a little easier than that for direct IO. You just ignore the cache mapping completely and you read directly into whatever pages the application asked you to. The only special case would be for writes, and at the point that we know what byte number we're writing to, we can just invalidate that range for ODirect write. It's reasonable. Right. You actually don't have to, because for writes, you're using the iNode, you're not using the shared iNode, you're using the write iNode, whatever you want to call it. The read cache won't be stale because the write is going to a different block completely. Because you're... In my head, we're just always keeping the read cache, but I guess if we evict or unlink and reallocate it, we'll evict the read cache at that point. Yeah, I think the part I'm not explaining well is when you're writing, you're not overwriting the shared block. You're writing to something new because you've done cow. So you don't have to worry about invalidating the shared block because that's not where your data is going. Yeah. I know it's a simple, but when it comes to no cow, then you'll have to eventually evict something. Don't do this with no cow. Okay. Make your life easy. Yeah. Okay. Yeah. No, but if this... So a little bit butterfist, you can have something like file-specific no cow, then how do we handle that? File-specific no cow breaks the sharing when you write to it. So if you snapshot and then, if you snapshot a no cow, it's shared, yes, but when you write to it, it cows at that point because we can't write into a shared no cow region. So that'll trigger the no cow and then the second right will be no cow. Oh, that... Sorry. The rights on the shared no cow region will trigger a cow anyway, and then the second right will be no cow. So... Okay. How about cleanup? I mean, when you're breaking and this mapping, you can end up with a block pointing to nothing. I mean, is that an issue? So yes, that is another question I've raised that when do we flush the read cash pages? So like when the file is closed, does it still stay in memory? So of course, if it is shared, it is always possible that some other files are trying to access it or it may access it in the future. So in that case, do we flush it when the iNode is evicted or keep it for later? I mean, I think for the beginning, we just don't manually flush anything. We let the MM evict pages for us. I mean, this is the same thing we have with the Btree iNode, right? So like it's... We just have pages sitting there. Now, we have the benefit of like forcing the MM to invalidate pages when we free blocks that we know we're not using anymore, and we could definitely implement a scheme like that for this. But at the beginning, like, we're not going to OOM because the MM is going to come in and tell us to invalidate and evict pages. So and when OOM comes, we... That's the first to go. Read cash is the first to go. So I would say we need to be sure like once we actually write in the private iNode address space that we invalidate those pages in the shared address space because now we've made a cache alias here. And when we've written to the private iNode address space, like, we've... We need to kill the cache alias. Am I making sense? But what about like if there are actual shared extents, like if there is a reference to another file altogether, which is... In this case, what's happened is we had a shared extent, we read it into the cache. That shared extent was deleted because all the snapshots went away. But we still have the cache there. And we need to delete that cache alias before we write to it. Right. So, Chris, what I was thinking is that you would actually have one of these special iNodes per extent. And so if it's gone away, then you delete that iNode and the normal stuff happens. So I mean, we could definitely do this like we know when we free the extents, right? So like we can go and, at that point, do the same magic we do with our metadata, right? Like, okay, we freed this extents. Now we punch a hole in our mapping for this range. And now we don't have the aliasing problem. So do you mean to say we use a shared iNode for each shared extent? How expensive are your iNodes now? That'd be pretty... Right. Why... I mean, isn't some iNode basically free? The extents can be as small as 4K. So I don't think on a multi-terabyte file system we want one iNode per 4K. Well, I mean, you don't want extents that small either. I mean, yeah, that's always the challenge, right? Yeah. So I like the idea of like the one address space per file system that uses the logical address space. That was what I had in my head, too. I've spoken like a man who doesn't understand how bad the readex tree is. Well, and that's also why, you know, the initial question of should we use this for every single read regardless of whether or not they're sharing? I would tend to say probably not. I would tend to use the private iNode address space as much as we possibly can and fall back into this for the shared case. Right. Yeah, I mean, and these are implementation details that are kind of the most specific to ButterFS. You'd be surprised. What? I mean, you'd be surprised. I've been looking at this for XFS as well. And almost everything you're saying has some parallel in XFS. There's details here, but the broad strokes of what you're saying, you could be talking about XFS. Right. So I guess what I should say is these are file system specific implementation details that are going to look broadly the same across file systems. And they're, these are easy because the hard part is how do we make the MM break the sharing? So one of the things that Roman has done has said is like having, you know, you have like essentially stacked mappings, right? You go to read, you go to check the read cache or whatever. And now you link it in. And this is the problem now is that you have a page linked into two different mappings. Right. Well, so Roman's idea is that we just like return the read page and we don't actually like link it into the thing, but then M map gets kind of screwed up here because you can't really like M map. You can't map the read cache page into the process, right? Yeah. And so this because you, you get your file system gets called on right fault. Yeah. Right. So at that point, you know that you need to allocate a page in the private high node. Right. So like at MK right time, yes, but like a read map, like, and then the problem is, is the problem we talked about earlier is we can get, uh, get user pages and not tell the file system and mess with my read cache thing that's supposed to be read only and not be told until way later. But we have full right. So at the time where you see a full right cow then, okay, I have to make sure this is working because like I said, we have a bunch of machinery in place to like catch this and I know it still happens. So like if I need to be doing something in Butterfest to make sure that this that particular path can't happen, then hooray, I'll look into that. But there's some real question about what do we do with a page that we want to essentially share across multiple mappings and how do we make sure that the cow happens whenever we are right to it through a specific right path, which I think is easier. But then like, how do we map it and then how do we reclaim it properly when we have a bunch of I nodes that are referencing it because we only have one I mapping or mapping. Oh, this sounds like a great opportunity for Willie. Please put up that slide. You were working on earlier. Yes. See ya.