 So, you know, this talk is about improving iVersion. iVersion is the thing we have in the iNode struct that, you know, kind of tracks changes. So, the motivation for this is NFS, obviously, so V3 relies heavily on watching time stamps and time stamp, Linux uses coarse-grained time stamps when it, like, updates the C time or the M time, A time, and those time stamps are around one jiffy of granularity. And, of course, lots of things can happen in one jiffy, and that can lead to problems in the client, you know, thinking that it has an up-to-date cache when it really doesn't. So, for V4, what they came up with was this idea of a change attribute. So, it's just an unsigned 64-bit value, and that has to change anytime the C time would be updated. It's usually, it was, you know, originally considered to be sort of an opaque value, or at least client should treat it as an opaque value, but there are some advantages to, there are some advantages to providing a value that increases monotonically. In particular, the client can know that certain updates, that if it sees an older or a smaller iVersion than what it has, it can know to throw that out and that it's not valid anymore. So, V4, NFS V4 II also gave us a way for the server to report what type of change attribute it has, so the client can make a decision about how it should treat it. In particular, the time metadata is particularly useful. Right now, we pretty much always report it as undefined. What we'd like to do is be able to report it as monotonic, which is the zero one. So, when should it change, right? So, it should change anytime that the metadata change time should change, so the C time. Some servers can ensure that the thing is, that the change to that attribute is atomic with the change itself, but we can't do that in Linux. So, for that one, it should change it. We can either do it, we can bump the iVersion before or after we copy the data into the page cache on a buffer. Right, right. In particular, we, if we, right now, the next bumps it before the change is visible and that's bad because someone can race in with a get at her, see the new C time value or see the new iVersion value and then do a read before you have a chance to copy anything to the page cache. And now, all of a sudden, client has now associated this new iVersion value with an old state of the cache. And that's bad. So, if we do it after the change comes visible, it's still a little racy, but at least the, you know, the client should catch up pretty quickly and realize that it is, that its cache is out of date. And so, if we increment it before and after, that's also acceptable. So, in the field, it's basically just a dedicated U64 Instruct iNode. We have two flavors, right. So, if you were running like an NFS client or a stuffFS client, really what we want to do there is just copy whatever the server sends us, right. We don't want to try to manage this thing ourselves. But for local file systems, we have a kernel managed value. So, that's where we, you know, the kernel is responsible that whenever the, like, whenever it goes to update the time stamps, it also needs to bump this iVersion field. And we have some infrastructure in the generic VFS to handle this for a lot of stuff. And you can opt into that by just setting SBI version. We have this in ButterFS, EXT4, XFS, and TempFS so far. So, the kernel managed one is the really interesting one. We enable, it's enabled in all those files that I just mentioned, also requested it recently for GFS2, but they haven't done work on that yet. So, the original implementation of this was a, just a simple counter. So, but that turned out to be really costly for EXT4 and XFS. Because every time you increment the thing, you had to log it to disk. And so, in 2018, we changed to a new scheme where we used the lowest bit in the counter for a query bit. So, anytime you go and do a get at or against the iNode for the iVersion, we set this flag and then we know on the next time that we go to update the thing that we have to go and bump the counter. Otherwise, we can leave it alone because if no one has viewed it, we don't really care, you know, there's no, you know, there's no delta. So, it's okay. And then with this, we were able to recoup almost all the performance that we lost when the count, when we first implemented the counter. So, right now, in fact, EXT4 originally had a, had a mount option for this and we just recently deprecated it. So, yay, because it can be on all the time now. But it has problems. In particular, I did some work a while back and I started noticing that sometimes I was getting weird cash invalidations and I found that XFS and EXT4 both were bumping it on A-time updates, which was bad, right? You know, you don't want to invalidate your cash just because somebody did a read. So, we fixed that in EXT4 recently, but XFS uses their iVersion counter for some other uses and they don't want to make that change. So, we can't, we don't have a solution for that just yet. The counter is also bumped currently before copying to the page cache. So, like I mentioned about the, you know, client that can cause the client, you can race in and get an invalid cache. And then we also have the potential for lost updates due to crashes. NFSD doesn't wait until you've, until we've logged the value to disk before it starts presenting it to clients. So, you can craft, you know, you can hand out an iVersion value, crash, and then have it roll backward. And then, even worse, you can do another write and now the client has a iVersion value it thinks is associated with the current state of the file, but it really isn't. It's a duplicate iVersion that represents two states of that file and that's bad. So, what NFSD does is it tries to factor in the C time on the, on the file to, you know, to sort of mitigate this, right? Because if it crashes and comes back up and you, you know, hit it, then presumably you have a different C time, you'd be able to use that to distinguish between the two states. But, you know, then you could have a potentially a clock rollback that could result, you know, after your crash that could result in you seeing duplicate values, but there's only so much we can do. And the other problem we've got with this thing is it's really difficult to test, which is why the, you know, it was broken for so long anyway with, as far as like A time updates go. We don't really have good test cases in XFS right now because there's no way to query it from user land. To find this value you have to go through NFSD currently. So, anyway, I talked about, you know, when, when should we do this? So, queries of iVersion, we don't have any locking around it. We usually update it alongside the C time. So, for directories that's C times are usually updated after the operation. But for rights we usually do it before. So, here's some possible solutions. One of the things we can do is separate the iVersion update from the C time update. You know, still bump the, there's some, you know, do we really want to change when we bump the C time? I think there's some, you know, there's possibility of, you know, introducing new races if we do that. So, I'm a little leery of doing that. But bumping the C time, or the iVersion after the operation would be fine as far as we're concerned. So, one of the things we could do is just do an extra iVersion bump every time. So, we bump it before we do the operation when we set the C time and do it again afterward. Most of the time that, because of the way this, you know, because we are using that flag-based, you know, scheme to determine when we should do it, that second bump will probably be a no-op in almost all cases. The only case it wouldn't be is that someone raced him in the get-out. And none of this is actually needed for XTFS since they serialized buffered reads versus writes. I thought it EXT-4 did, but Jan pointed me out that I'm wrong. So, butter of S, EXT-4 and temp of S probably need to bump, need this extra bump afterward. Crash resilience, that's a problem. We have some potential. We could, one of the things that was bandied about on the list as we were discussing this a few months ago was we could factor in a crash counter, stick it in. We probably have to be tracked by user land somehow. So, it's an idea. You know, one of the things we could, but the problem is we have to, like, estimate how many, how many potential increments could we lose, right? You know, if, you know, what's the, what's the maximum number of times we could bump this thing, the counter, beyond from when it was logged to, you know, what, the last one we handed out. So, you know, there's a way we could do this, but it's not trivial. So, anyway, this is some things I'm looking at. I haven't gotten very far with this one yet, but. So, this is an idea here that Dave Chinner came up with, actually. He mentioned, as we were discussing a lot of this stuff, he said, we should just use the C time. In fact, what we could do is just, you know, use the, use a similar scheme where we flag the C time when, when someone queries it, and then know that on the next C time update, we're going to need to do a fine grain time stamp. And this one actually works pretty well. I have some draft patches for this that I've sent to the list a couple of times. I do see some test failures with this currently. A lot of that is because some of our tests do things like, you know, get a coarse grain time stamp, do an operation, get another coarse grain time stamp. And then, you know, with these patches, sometimes that, that last coarse grain time stamp looks like it is before your operation finished, right? And so that, you know, that causes a test failure. And those are really problems in the tests. The tests are assuming that the file system uses coarse grain time stamps. And with this, we're breaking that assumption. So, but this is where I really want to go. You know, so in the future, you know, I want to make this eventually queryable from userland because I think that, you know, this, you know, having a infinitely granular change counter or change attribute is, you know, very useful for potentially for userland upper userland applications of all sorts. And in fact, what I, what I'd really like to do is build a gated write. So I'd like to be able to do basically what you see in the pseudocode code here, right? Be able to fetch the, the, the change attribute, do a read, modify it, and then try to write it back, but only if the version hasn't changed. It's kind of like what Jan was saying, like we were doing with the read where we would just, you know, repoll the thing every time. And the nice thing about this is, and I used a very similar scheme to this when I built some, some work and I did in Cepha a while back and I, and I really liked this because I didn't have to use any locking. You know, if you're in a distributed situation and you don't, you know, it's, you're trying to do lockings is nasty. Yes. Yes. We could totally implement this with verify. If you're doing this, perhaps your better off if I version isn't incommented for things like Jamod. I'm sorry, what? Perhaps your better off if I version isn't incommentative things like Jamod. Yeah. Yeah. Unfortunately, you times. Yeah. Unfortunately, you know, our semantics are kind of set, at least as far as NFS goes. So I think we would have a tough time. I mean, what you want is an AFS style change attribute, but we don't, we don't have that. So, so yeah, you might get some false positives if this were to happen. But I, you know, my, my feeling is that problem is not the best. What if we decouple this change cookie completely from my version, combine that with the crash counter idea that, you know, we had discussed earlier. In that case, the change cookie could just simply be an in memory value. Well, I guess we have to worry a little bit about what happens if the inode gets pushed out due to pressure, memory pressure. But, you know, in some ways, if we didn't actually have to write it out or could rely mostly on C time, then, you know, we avoid some of the extra overhead of having this extra thing that the file system has to track that has its own unique semantics about is metadata changes like Chimad included, yada, yada. Well, I think that would be fine if we, if all we wanted to do with it was this. The problem is that NFSD still wants to use that version two. And if you were to crash and come back, then all of a sudden you're, we can't keep a really an ephemeral one for that reason, I think. I think I might have already mentioned this on the list. The only thing that I find important is when we expose this cookie that it has clearly defined semantics and hopefully consistent semantics because I, I always, when I see stuff like this, which kind of feels like it's not really clear what it indicates, I've always has F F F underscore FS ID in mind in stat FS, which is like nobody literally on a man page has nobody knows what it's supposed to mean. Yeah. Yeah. I mean, that's fair criticism, right? You know, I've been trying to define this and, and really, you know, unfortunately, POSIX doesn't contain this, right? You know, so this is a totally a thing that, you know, AFS and NFS cooked up, right? But yeah, I think we can do that. And kind of what the, the semantics we come up with is that basically anytime you would change the C time, you would need to increment this thing. And that is pretty well defined by POSIX. So I think that we actually know when, you know, what it represents and when, when it needs to be in incremented, but, but beyond that, you know, but yes, you're right. We do need to document it before we ever expose this thing. Absolutely. One thing that might mitigate against double incrementing it is the way, with the way AFS does things, you do a right to the server and you get back a new iversion, a new data version. You know, it's going to be plus one. If it's the old version plus one, there's no interference with your right. If you double increment it sometimes and not other times, you don't have that guarantee anymore. Well, you know, we don't have that guarantee with NFS anyway, really, because it just, it changes. Yeah. Yeah. And I don't think we can really, I don't think, you know, unfortunately, AFS's change attributes semantics and NFS's don't really mesh, you know, AFS has a change attribute too, but it's only bumped on content changes. So it doesn't, it doesn't care about metadata changes. Yeah. And you do it exactly once for every right, which we can't guarantee. I don't have the exact discussion in mind anymore. So excuse me if I'm asking a stupid question, but how is, when a server crashes and NFS server crashes and crash happens, is it possible to detect this? Like basically what happens, why isn't it possible to just say, okay, I need to re-query iversion after a crash? Yeah. I mean, that's potentially something we could do. The danger right is that, let's say you have an iversion of two, right? And then you do a right and you get an iversion of three. Okay. Now you crash and that iversion of three never made it to disk. Server comes back up. So the client was aware of this, right? But someone else races in and does a right. Now we have another iversion, you know, iversion was two after it came back up. Now, now someone else does a completely different right. Now we have an iversion of three again. In between the crack. Yeah. Yeah. So we can't, yeah, that's the, that's the real danger, right? You know, the, the rolling back is not so bad. It's that you can roll forward again and get a duplicate value that represents a different state of the file. So that's the crash. Yeah. The crash, you know, mitigating crashes is a really tough part of this. And I don't really have great solutions just yet. I'm open to suggestions for this. Really, you need a server level, last time I booted or last time I crashed thing, but you can just say, oh, wait, wipe the cache. Yeah. Well, even then, I mean, it's not, yeah, I mean, you could just, but then that's nasty for performance, right? You know, so yeah, the crash could, the crash mitigation part is actually really tough to deal with. And I'm not sure, you know, we have some sort of hand wavy ideas. Yon had a good one. I mean, it was like, you know, that we should do this crash counter thing, which we could totally do. But you know, we need, you know, need to sit down and really work out this. I think AFS has something like that per volume. If the volume has to be recovered, update the timestamp on it, wipe everything. They don't have to deal with random file systems. Yeah, I think the other reason why AFS is a lot simpler is you don't have to worry about local rights. Exactly. That's the reason why you can guarantee the it bumps by one when the users make a change. You don't have to worry about a local right interfering. The one thing I was going to note here is we already have a boot UUID that gets regenerated each time the system reboots. You know, we've actually had that for a long time because, you know, we had the UUID and generator and it's, it's, it's in DevRandom. It's, it's essentially each time the system boots, we generate a random, you know, 16-byte UUID. You can fetch it. It's like a global variable. Never changes after you reboot. Now it's not a counter. It's 16 bytes. You could hash it somehow. So I'm not sure how useful it would be in the context of fitting into the existing NFS protocol, right? But there is easy ways that you could, you know, have the NFS server tell, oh yeah, you know, we've since rebooted if you wanted to, you know, somehow use the previous version of the boot UUID and see that the boot UUID has changed since changed. That's a good idea. Thanks, Ted. I'll take a look at that. Anybody else have questions or comments? I might be raining on your parade. But one comment I've been hearing multiple times in the room is, well, we're not sure what the semantics of this are or the semantics of NFS treatment of the attribute. It's not the same as AFS. And so I'm wondering if we might not be better off if each of these protocols handle this in their own way. The one thing you mentioned where you wanted this to be like exposed to user space because the only way to test it was through NFSD. It seems like, you know, going over a loop back or using PyNFS seems like a good way of testing that it does exactly what NFS needs. And I'm just wondering if maybe we might want to rethink the idea that we want a common implementation for all file systems. Because the one thing is not mentioned here is that the performance overhead is if everybody did this themselves versus it being done by the file systems. Could you talk a little bit to that? So you're talking about having the file systems manage the i-version or maybe the VFS? Well, what would the performance overhead be if NFSD did this for itself and AFS did it for itself? Because the semantics aren't the same. I don't think we can really do that because, again, we have to deal with local rights, right? You know, NFSD doesn't know if someone has walked in and started doing rights to the file system, right? So one thing that may help me understand context a little bit better is when I grep for i-version, I see XFS. It looks like it manipulates it. Do other file systems, Ceph and EX24, the rest, basically ignore it. But it looked like there were some libFS helper routines that mess with it. So it was not clear to me whether the VFS, looking at the code at first glance, whether the VFS bumps i-version without you having to do anything, or whether this really is a flag that only three file systems can bump counters on. Yeah, I mean, if you set SBI version in your file system in the i-node, then the VFS will do that for you. So whenever it goes to update the C time, it will take care of the i-version as well. Because I looked at that, and I'll go look at it again, but I didn't think there are very many that set that. There aren't. Yeah. And I don't care. I mean, it seems like if it's low risk and it costs nothing, you know, turn it on. But what puzzled me is, if really it's just three file systems that set it, then what if VFS is mounted on overlay FS or I don't know what it picks? Then it manufactures a change. It manufactures the change attribute from the C time in that case. This is why I'm going to, one of the reasons I'm going to this multi-grain time stamp idea. For instance, the XFS guys don't want to put a new field on their disc in their on-disk i-node. But if we can do this, then we can still manufacture a change attribute from the C time. And it's good enough with the fine-grain time stamps. I think having this be a helper, for example, and this being done in the VFS seems fine to me. I mean, three file systems, okay. Inputation, what's the right number for something to move up to the VFS? That's fine. The only thing that I didn't point out I was trying to make just to clarify what statics change cookie being part of user space API is that users have a difficult time making sense of a value that is exposed to them that has different meanings for very different things. And that's the main problem. When you suddenly have statics change cookie, I'm opening up the man page and I see for XFS, for NFS, and for AFS. And then I have to read through all of that and like, okay, this means three different things. That's the thing that I'm concerned about. Yeah, I don't think that, I think in general, all the local file systems that use the SBI version have the same semantics. And those semantics match what NFS uses. The only real odd man out is AFS. And so I probably just won't wire up AFS. If we ever were to present this to user land, I just would never, I just probably wouldn't wire it up for AFS. Or we have some, we have what the statics file attribute flags or whatever. We have one already where the file system can claim that it's monotonic or that it's statics change cookie is monotonic. We could do one that said this is an AFS style statics or time stamp or an AFS style change cookie. Monotonic but not metadata. Right, right. This one won't work for metadata changes. And maybe user land can sort through that. So look at it and say, oh, I need metadata changes. I can't use this. So basically the only thing we really want this for is for NFS state to export. Currently, right. But eventually I would like to use it. Because if you're going to expose it to user space, what's the minimum promise we can use a space that you need to make? That it will be bumped anytime the C time would change. Is that the minimum? Well, yeah, but what would user space be using it for? Is it to detect changes to the file? Yeah, but could it be sufficient to say anytime M time changes? And stuff there? Because like someone said that. Well, no, because I send a set X after on it. It'll change, for example. Yeah, if you put an X after on it, the time thing changes, right? No, I mean, you're arguing that we should use adopt AFS semantics. I'm saying that NFS wants this. We already have semantics in all the other file systems to do it a different way. So I would not be a, I would be opposed to that idea. I, you know, just because it is, you know, AFS had it first, but their semantics don't match what we needed for NFS. Because the other thing I want to be using this for, but from the FS clients point, so the NFS clients point of view is tagging the local cache, which means if someone does Chimote, I can go to your local cache. Exactly, yeah. Which is unfortunate, right? You know, that part is a bummer, but those, those kind of changes are pretty rare. Yeah, but actually if you do these multi-grain timestamps, then you just base code basically just use the C time to detect and doesn't have to bother with some obscure argument like obscure static. Yeah. And in fact, you know, this is, I'm liking this better because this means it works for v3 too. Yeah, it's also it's much easier to understand like for the user space application, but it's actually getting. Yeah. And in theory, you know, the most user space applications should be fine with the change in semantics here too, I think. So I'm probably going to, I see myself probably implementing this kernel wide eventually. I'm probably going to do a piecemeal per file system to start, but I imagine most kernel file systems will want to adopt the scheme. Yeah. So my point is that if we do this, that I don't see then such a need to implement the static's cookie, change cookies. Yeah, we may not need it after that. Because then it's utility would be very, very limited. I totally agree with what you're saying is that this can be as 100 nanoseconds is 100 nanoseconds good enough. But why wouldn't you use m time instead of C time if you're worried about the Tramod example is m time good enough because everybody can explain to a user space what m time is in C time. But why isn't m time good enough for 99.9% of it? It's fine. It's just the 100 nanoseconds good enough, by the way. If I do it right, a new version of a plus one, I know that it's changed 100 milliseconds ago, 100 nanoseconds ago. I don't know that there was the time stops are not enough information. Yeah. So one other really quick thing, which is earlier in the slide, you said that the kernel only uses a second level granularity. That's true for the stat system call stat 64 uses a time spec. And in fact, most of the major file systems are providing sub second granularity anyway. It's not just XFS because, you know, make can compile an awful lot of files in one second. So yeah. Yeah. Oh, and the other thing is the reason why people like C time and not m time is you can fake m time using the U time system call and warp m time. Whereas you can't warp C time. The next session is actually in the BF cross talk, but they're not on time. I mean, I can see on zoom that they're not on time. I'll go check when the time, but you guys can okay. Yeah. Yeah. I mean, and I'm kind of moving thinking that yeah, probably I'm going to move most of this stuff to using the multigrain time stamp now because that again, I'm just in early days on this, but but I think that's probably where we want to go. Unfortunately, I have to fix a bunch of test cases, it seems like. Yeah. So like for internal in NFS use, I can totally see you doing the I version thing. Yeah. But for user space, I believe the multigrain C time should be enough. Sounds good to me.