 All right, I'm gonna go ahead and get started. So I just added this because we started talking about it during Kent's thing, and I figured it deserved a wider audience because I have to re-explain this problem like every couple years when somebody discovers it. So what happens with butterfs is we have this concept called subvolumes. The subvolume has its own unique inode space. The subvolume is what we can take a snapshot of. So the common use case is you have a subvolume for your home directory, so you can snapshot your home directories. You have a subvolume for the root file system, so you can snapshot the root file system. In Facebook we have subvolumes for the container images. So you can snapshot a container image and start a container right away from a clean image. So the problem here is that when you snapshot it, it's literally just a metadata block that points to an existing block and adds reference counts. So you have the same files, the same data, and the same inode numbers. So you stats, if I have a picture in my home directory and my snapshot and I stat both of those files, I get the same inode number for both of these files. This confuses applications, rsync is the main one that we were concerned with at the time. Chris came up with this relatively simple solution which was the way rsync or find or anybody else uses to determine whether or not they've wandered into a different file system is you change stdev and the stats. So we have an anonymous block device that we allocate for every subvolume and when you stat a file, we give you that anonymous devices, stdev for that file. This is a really easy way to solve this problem because now rsync goes, okay, these two files, although they have the same inodes, they're different because they're on a different file system. When in reality they're on the same file system. Now, Slava's asking if there's slides. There's no slides because I wasn't planning on talking about this here because it makes me very angry. Every time this comes up, people yell and complain about how terribly broken it is. Sure, it's not a great solution because there's no file system that does this. There's no file system that has the same inodes, inode numbers but an unique inode space per subvolume. Because in reality, what you have is you have a subvolume ID and an inode number and this is how we export things over NFS. The NFS file handle thing, the unique ID, is the object ID of the root that the inode belongs to and the inode number and this combination is in fact unique. So when you do the file handle thing and SplatterFS has its own file handle, like every file system does, you get the object ID and the inode number. This works really, really well. Ceph does this as well. Same inode number, same FS, cool. So anyway, this comes up every few years when somebody discovers it and thinks it's super broken and then tells me all the ways that it's easy to fix and realizes quickly after a lot of emails that it's not actually that easy to fix. The most recent one was, oh, I keep blanking on his name. He maintained, yeah, Neil Brown. Neil Brown discovered this. This is particularly problematic for NFS because NFS, if you export the file system, the directory that contains the subvolume and its snapshots, then you can have the same inode numbers for a bunch of different files, but NFS on the client loses the fact that these are in different subvolumes and so it can't change the STDEV on the client because it doesn't know it. By the time that it gets over to the client, the client has lost all knowledge that this is a ButterFS backed file system. And so there's a lot of things that he tried to do in order to fix this. And eventually, what I want to do, and like Bruce said, this would probably work for NFSD, is that instead of doing all the weird things which he was like XORing random bytes and then he was XORing the object ID of the root into the inode number, and this ends up really terrible because you can still end up with the same number, is I want to extend STATX to tell you the subvolume, like essentially the subvolume ID so that you can say STATA file, because the other problem that you have with this is that if you STATA file, you get a different STDEV and you would don't actually know if this STDEV is on the same file system as another file and another snapshot. And so this kind of is problematic in a different way. If you want to know if these two things exist on the same file system. So I want to extend STATX to say to include two pieces of information. One is a UID for the file system, saying this inode belongs to this FSID. We have the concept of FSIDs everywhere, we have it in block ID, we have it in all of these places. I want to extend STATX to encode the FSID of the containing file system. This way we can easily tell if two files are in a different subvolume that they're a part of the same file system. This is for user space. And then what I want to do is add another field to STATX and so this is where it's a little bit. We can do one of two things. We can add a U64 to include the object ID of the root. So this is unique and so like then NFS can say like, okay, I know that these things are unique and then they can do whatever it wants to do. So know that these two objects are unique. Alternatively, ButterFS has UUIDs for all the subvolumes. So if we wanted to say, keeping in the theme of having UUIDs, we have STATX have two new UUID fields. One is the FSID for the file system as a whole. And then secondly is the subvolume of UUID, which is again still unique to ButterFS. And then again NFS can do, because then NFS has all the information it needs to have about whether or not this object is unique or not. So that's what I want to do. Question, in what way is that different from Snapshot Team EXT4 on LV Thin, right? You still get Snapshots of the file system, different block devices that are dynamically allocated. In what way is it different? In what way is the use case? Is the user's story different? So I think it's that you still have the same problems with the EXT4 Snapshots, right? Because, yeah. Yeah, as far as I understand it, we have this problem. Anybody that has the Snapshotting is gonna have this problem. The only people that do Snapshotting that don't have this problem is LVM or anybody that does it at the block layer level, right? Because then it's a completely separate block device. But for file system-based Snapshotting, like we're all gonna have this problem. Sorry, Layton is saying something. Yeah, so Layton is saying that this means formalizing the concept of file system and subvolume, and that may be a good thing, but we need to consider other FSSs might not fit neatly into those concepts. So that's what I'm kind of wondering, is like I know all of the local file systems that I pay attention to have FSIDs, right? Like E64 has that, XFS, I'm sure has it. I don't know about any, so who? Oh, does FAT, I don't think FAT does it. But who gives a shit about that? So, how was I saying that, asking if FAT does it? So like I guess that's a good point because like EFI, USB stick, plug in, and it's not gonna have a UUID to spit out. Does that matter for that case though? Like you just simply don't have a value for that, and I don't think that's okay, right? Right. You're asking about UUID, right? Not FSID. Well like FSID, I'm using them interchangeably, but like FSID is like the UUID for the entire file system and then like the- Because there's a different FSID that's something. Okay, let's let Todd. Yeah, I think some of the FAT file systems actually do have a 32 and maybe 64 bit FSID, it's just not a UUID like thing. Right. And so that has come up when people have proposed generic file system APIs for setting the UUID on a file system if you wanna change it because not all file systems have UUIDs that are a standard 128 bit UUID. I think one of the bigger questions though is when you say that two files are in the same file system but they're in different sub volumes, kind of begs the question of what does in the same file system mean, right? One definition is you're allowed to move the file using rename between two different directories in the same file system or you are allowed to create hard links between two directories that are in the same file system and I'm not sure does ButterFest sub volumes allow you to do that? So ButterFest sub volumes you can't rename and you can't hard link, you can rough link. Okay. Yeah, because I think that's one of the things we will want to be very clear when we say what does it even mean that two directories are in the same file system because some people might think that that means you can MV between them or hard link between them and I think you have a different definition of what it means for they're in the same file system and I'm not sure you've clearly articulated your definition of, you know, from a user space application why would they care that they are in the same file system, right? One reason is they wanna do hard links. There may be others. So for my use case I need to know if these two files are the same file system is mostly just from like a maintenance thing, like oh this file is in this file system and this file is in this file system and so I need to unmount this specific file system and once I do that then I'm good or whatever. Like different maintenance tasks where I can go okay these are in the same file system and this is the file system that I'm looking for. What got me super excited about the ref link example? I mean whether we allow hard links someday whatever that seems less important but the thing that just jumped out at me, right, is that although I'm working on Linux, the kernel, you know I'm in Azure, right, and the idea that you could have two exports and suddenly with this change, you know, a few weeks ago, right, we can now do a ref link in theory. Now that doesn't help with everybody, right, because at the moment we don't allow that in the server but it's sort of like jumped out at me though, I guess we should. We should ask our server guys to allow that. You already support a copy range between different file systems and that's enough for ButterFS to implement a ref link. You don't need to support the clone operation. Copy range will do it. And I think maybe Kristoff had an objection because we had already supported like you said, we had already supported the copy chunk and I don't remember who it was but a number of years ago we had an objection from somebody so we chose the strictest form which was the clone basically. Where a CP, like when you do a CP instead of using what Windows uses or what a Mac would use to do CP which is offloading the copy to the server, it would require doing a ref link. Well, that works now. For example, if I mount to Samba with the change that went in a few weeks ago, copies instant because CP does a ref link. Right, but Olga implemented the copy file range across mounts for NFS and that should work for ButterFS and for CFS as well. Yeah, exactly. So I mean, I found it really exciting, the whole thing but your point is valid though about how do you find out the FSID? I looked at the EXFAT code just a second ago and it looks like they manufacture the FSID from the block device. Right. So it's a nice hack and it's probably good enough, right? Yeah, so this is mostly just to give the user more information, right? So it doesn't have to be perfect. It's first and foremost it is give the user more perfect information because right now I can stat a file and I can't tell where it's mounted because it gives me the anonymous BDEV, it doesn't give me the BDEV of the actual mounted file system. So that's a problem, right? So if I have the FSID, I just look at the FSID and I know where I'm mounted. And then the other, the second big thing is exporting things across NFS or SIFs or whatever and suddenly you get the same item numbers that are for different files. And giving the NFS guys more information of this is an FSID, like a different FSID or different subvolume or whatever, then it lets them do what they need to do in order to make the things unique. Yeah, I remember it came up with something on the KSMB, the kernel SMB server thread. The whole issue of FS, it was maybe something on the email thread a few months ago about this. And I think they properly advertise it now but at the same time I don't think it's as important as the NFS example because these are, if you're exporting two shares that happen to be the same thing underneath. That's the thing, it only confuses find or our sync or whatever, like NFS handles it fine like you, because again it's using the file handle and the file handle is resolved to the unique object. It's purely like user space trips over this thing and doesn't know what to do. And NFS has the ability in the protocol to tag this extra information so that it can then on the clients represent, okay, these are two separate file systems. Don't, like they're gonna be, like they could have overlapping item numbers, don't treat them as the same thing. And so like I need to have a way, a standard way to let NFS be able to do what NFS does. I don't care what they do. So I just need to have a way for them to know how to do it and a way to tell user space, hey look, this file belongs to this file system and so you can then do all of the like, find where my file system is or what disc it is, magic. And also a better way to like articulate that this is on a different sub volume from another file. So I was wondering why LVM, cloning, whatever you call it, doesn't have this problem. Because doesn't that just copy the entire block device so you get the FSIE copied as well. Yeah, but you get the different block device. So like it, so when you mount the snapshot or whatever you get unique, it's the same thing that what Butterfus does now. You get a different ST Dev. Also will that be persistent across reboots? Yeah, it'll be persistent because it uses the actual ST Dev of the actual device. Whereas we're using an anonymous one that's just random. And the other thing that I was wondering is, is static is the right place to put this? Or do you want an extended StataFest effectively? And then we can add several UUIDs or FSInfo or whatever comes, is chosen for that. Because then you have a UUID for your snapshot or UUID for your base, your complete file system thing in a UUID for devices on. Yeah, so like I think that some of this stuff, like the global FSID thing can be in a StataFest or whatever. It's just that you possibly don't want to cart this out every time someone does a stat system cart. Right, but I do need a way to tell stats, like this is the file system that I'm on, right? Because like, or you need to be able to say like, this file belongs to this file, do that mapping back. Kind of like that stat is the only interface for that, right? I don't know what to do. Stats, yeah. I'm just still a little confused about the FSID. It seems the main issue will be, the main argument that will be leveraged against this is, it needs to be something that makes sense for all file systems. And even for ButterFS, maybe it's just a terminology that's confusing me. If you say FSID, that doesn't really reflect the sub volume concept, right? Sub volume is not a file system, not a file system ID. So I think it should be some generic UUID that file systems can fill in or not fill in and then not provide the FSID information. Well, so all file systems have a UUID. So when I say FSID, what I mean is the FSWide UUID. So we have UUIDs, X4 has UUIDs and as far as I know, XFS has UUIDs. Like the, when I say FSID, I mean the FSUUID and then ButterFS takes it the next step further and that every sub volume has its own UUID as well. So like you can, like... It still has an inherent double meaning, so to speak, because some file systems don't really use it as a file system UID. They extend it to a different concept, which in this case is a sub volume. So like, I mean, I can easily just say, you know, U64, right? Like I think that having the UUID exportable or you know, we can just get it from BlockID, it's don't need to actually export that. Like having the UUID for the file system, we can already get that through other means. We don't necessarily have to have that. It's the sub volume, it makes it easier for user space to map back to what they're on. But the real issue is like having a way to articulate that you're on a separate I-node space. One option that you have, either with stat X or maybe stat FS. So stat FS already returns a field called FSID, right? Sometimes it's a digest of the UUID. Oh yeah. Others file systems maybe like temp FS or it's just STDF usually, but there's an identified fire. Maybe you could go the root of instead of adding things to stat X, use the add flags to say add root or whatever and then just get the FSID of the root. Okay. That's an option. One more thing, I don't know if you guys know but a couple of years back I added the file system watch to FS notify. The ability to watch events on the old super block. So ButterFest is supported but not a sub volume because of the fact that the FSID of the object inside the snapshot is different from the root. I accepted it. So just to, you know, that's one more issue that is out there. People cannot watch a sub volume. Okay. That's good to know. I didn't know that. This gets better and better. Layton says there's no add or sub volume constant NFC4, you'll need to run it through IETF anyway. Yeah, Layton, isn't there, I thought there was some way for you to articulate like this is on a different file system. Like as Neil was saying that he was gonna have to do something awkward for NFS2 but NFS3 or 4 were gonna be fine. As long as you have this other unique identifier to say like, okay, this identifier plus I know number. Okay, and like I don't know how you guys wouldn't do it but like I know Neil was trying to like mess with the I know numbers but if we could just give you a unique identifier that says, okay, this is the root or whatever. I didn't understand why he couldn't just use our file handle because we encode this in the file handle already but I mean, I have the information to give you. Just how do you want me to give it to you? Yeah, exactly. So it seems like even if you have one solution for NFS and a different solution for locally mounted files where you modulate the block ID so that R-Sync does the right thing for a local file system. ButterFS access through the local file system. You're still gonna have the issue for R-Sync because R-Sync is only going to know about, well, R-Sync or other POSIX only programs that are only using ST-DEV and the I-Node for a file system that started out as ButterFS with multiple sub volumes exported via NFS and then R-Sync works on the NFS client. So it seems like we still need to somehow do a best effort attempt to unify the I-Node numbers that are seen by the NFS client, right? I mean, even if it's just use a hash function that 64 bits will be mostly unique or something, right? Cause I think it's great to talk about things that work for programs that know how to use Stadex but it seems like there still needs to be some sort of best effort for naive programs or R-Sync on REL7 that is not using the fancy latest interface. Yeah, so like the ST-DEV thing is never going away specifically for that case, right? So what I'm talking about is the future is, we inside Facebook want to more easily say this file belongs to this disk so we can do X things, right? Clean way to do that is to export the FSID. I don't know what NFS has as far as options of unifying, but I know that Neil was like doing the XOR and all this stuff and he was just getting crucified for it but there appeared to exist a way to give the client enough information to do it as long as we had yet another unique identifier. And I've got the unique identifier, it's just which one do you want and how do you want it? Yeah, I think part of it is for those of us who have to deal with enterprise customers who you're gonna have to pry REL7 out of their cold dead hands. Yeah, which like the DEVT thing is a very elegant solution for this, right? Like it just solves it, at least on the local stuff. It doesn't solve it for NFS, but also REL7 doesn't support ButterFS, so I don't have to think about them. But REL7 could support ButterFS as accessed over NFS, right? Access over NFS, yeah, that's true. Now like to be clear, my response to this historically has been play stupid games win stupid prizes. Don't export the local cell volume and the snapshot over the same volume, problem solved. However, that wasn't well received. Could you maybe, do we have the time? Probably have. Could you maybe summarize the solutions that have been proposed and knacked so far? So Neil tried a variety of different ways to send the root object ID across to do some hashing. So like XOR and stuff. One, he just like made up a random number to XOR with the, he like XORed the dev team with the I know number, all this stuff Kristoff got really upset about. One thing that Kristoff always comes out with is like you need to make these sub like VFS mounts, which doesn't make any sense. Like if you still have the same problem, so he always says this and it makes zero sense. Is he trying to say Unikify by mount ID, no? Yeah, so I guess that's kind of what he's wanting is like he wants us to have a VFS mount for every sub volume. That is a non starter because the VFS mount has to have a unique sub volume, a unique super block. Now that clearly can be changed. We are programmers here. We can relax that sort of thing. However, you're talking about like an order of magnitude more VFS mounts. We have thousands of sub volumes and you're talking about how to like mount all of these things up anytime you walk into them and walk out of them. That's, no. Even the threads, right? You need threads per file system, right? So every mount would generate their own threads, so which will be another nightmare to take care of. Yeah, but there was another solution proposed I believe around two years back. I believe Mark Fasche was working on it on views, but I don't know how far it went though. Oh, I remember that. So it was basically, it has the same concept of having super, not super blocks, but a lightweight super block called view, which would basically be added to each sub volume mount and would supply the dev or something. Yeah, I'd like that would be reasonable too. I don't, this, I'm just tired of talking about this. Like I think that my solution is we extend SATFS and SATX to give us more information and then we let everybody else deal with that. I think that's the cleanest solution because everything else is hacky. The dev team thing is gonna stay forever because people are still running REL4, but as far as like how we move forward, I think that we exporting the UID for the file system is really helpful and supporting the sub volume UID per file for us is also helpful. The information that you said that you wanted to export to SATX, it's already available to NFSD, to Kernel NFSD. So the UID is available and the root is available? It's not yet. So like it's what NFSD uses KSATX, which it's not in there. So like it has no way of knowing that. So this is why I'm talking about extending it. What do you mean? So NFSD currently uses KSATX, right? Or whatever to get this? Yeah, but it has the entry. Oh, that's yeah. But like the entry, like the entry trees are all together. It's just you get us, you get like the same, I don't know. Yeah, you can't tell. Yep. It calls get at her for the FS. That's what Clayton said. So like if we like, we need to stuff it somewhere so that somebody knows something. I can put the, I can give you the, if we don't like the UID, I can give you the U64 object ID of the root. That's unique. Right. Like, well that's the thing is like other people, like user space has use for this information. I mean, if enough people ask for it once it's in KSAT. Yeah, that's true. All right. I think just the final point, I think the main, it sounds like the main issue here will be to summarize the solutions that have already been proposed and why they were rejected to build a solid base for why this solution should be accepted. Right. So the solutions that were suggested is Kristoff's VFS mount thing. I mean, just when you send a new patch. Oh, when I send a new thing. Yeah. Okay. Sounds good. No, Ted's got one last thing. Yeah. It seems like there's a LWN article from, I think it's August, 2021. The title is butterfess. I know number epic. Yeah. Which I think actually already summarized all of these different solutions. So it may be sufficient just to reference LWN article. Yeah. I forgot that you guys wrote that. So like, I think that, yeah, I'll link the article, but you know, it's talking, we're talking about exposing shit to the user space. And that's, you know, we'll be here a while. Cool. So we're running over time and we've got a joint session with IO track. Oh, sorry, Steve. So, you know, from my perspective, this is useful because it could affect, you know, kernel server, maybe, and you know, it's a useful topic. But what jumps out at me is static, ecstat, statx. This isn't the only thing that's like a 10 out of 10, obvious that ought to be added. I think that your idea of adding is good, but there's got to be some synergy for a few other things. There are, I think, six flags that I could add that NTFS also supports that seem to make just as much sense and just as important, you know, is a file offline, right? I mean, your app needs to, yeah, I don't know, but it's what I'm getting at, is that I think this is a 10 out of 10 important topic. So maybe we can talk about it tomorrow or the day after because there are other flags that have a lot of value ahead and, you know, for BTRFS integrity checking. Okay. And maybe we can talk about it more, but I think that it would be a good thing to talk about whether there's at least two file systems that need this, whether there's at least two file systems that need integrity checking or offline or because statx is easy, right? We can just add flags. Yeah. All right, let's, yeah, I see Omar. We're coming. All right, guys, we're moving over to the IO track. People on the virtual, on the Zoom call, if you want to watch this next one, go click on the Zoom link for the IO track. That's where we'll be. Thanks, everybody.