 on my talk is named FA Notify HSM API and I will explain the name, but basically what I want to get to is something else. I want to get to discuss a problem and a solution that John and I have been working on and I just want to give you some backgrounds on how we got to the problem or what it is. So HSM is acronym for Hierarchic Storage Management which is like a very, very old term for offloading files from disk to tape and FNATIFI is how I want to create modern HSM implementation using FNATIFI. I have a wiki on that for the press. Yeah, maybe currently to state a bit about the motivation, yeah, you want essentially to offload the stuff like to cloud storage, for example. Yes, so my employer, Citera Networks, they build cloud gateway solutions which are cache to cloud storage. You have local NAS, actually people that customers used to have local NAS, but they didn't used to have cloud backup for it. So now they're replaced with a local NAS, fast NAS, many local users and it's backed up by a much larger and much cheaper cloud backend store, slower tier and the local cache, the local NAS doesn't have the disk space to hold all the cloud name space, just parts of it. So it needs to evict like page in and page out files from the cloud, that's the general use case. Windows as an example of the operating system has an API for this. If you know Windows a bit, if you've seen like a Google Drive or OneDrive, they have an API with like these icons of file like with the cloud that the file is not available locally. If you access the file, it will be downloaded from the cloud and the operating system provides some infrastructure for that like a file that's placeholder is marked with something called reparse point which is a persistent marker on the file and then when the file is accessed, a driver gets called, that's the projection, projected file system and that specific driver has an up call to a cloud sync engine which could be Google Drive, Microsoft OneDrive or whatever. So they have the infrastructure for that. Linux doesn't have anything inherent for that. So products that do this sort of thing like Cetera products and other products may, I mean we use views, that's a common implementation for gating access to files and getting them from the cloud or the local. We use views that comes with all the problems that we've heard about in the two sessions before me and I hope those problems will be solved, BPF for faster read-deer and high rearing, they can only help because honestly, I don't think we are going to be able to get rid of views for all of our use cases but what I'm going to talk about here is how we can get, not rid of views but use a different alternative for the same use case using FA Notify. DMAP is an API that Unix systems used to have. Linux only had it in XFS for a while and it's an old API from the tape ages, like it's not enough for modern day use what I, for cloud, offloading to cloud but it exists, remnants of this exist and you know about it, like Punchhole is a remnant of D-Mappy. 2010, the D-Mappy hooks, the callbacks to user space were removed from XFS. With this comment, if we'd ever get HSM support in mainline, at least the namespace events can be done much easier, say in the VFS instead of individual file systems so this is what I'm trying to do now. You know, this was like a deja vu for me, even the D-Mappy stuff but what I was trying to remember is if I remember correctly, there were basically three states you could have. The file is present locally, the file is not present locally but you know its name or maybe you know it's a name and it's a file and then in between you had like, you had enough to satisfy a stat request, you know it's creation time and you know it's file size and you know it's time steps. I'm not going to, I didn't learn the D-Mappy standard. I think that's what Windows, like the Windows API you refer. Yeah, but for our solution, yeah, we use placeholder files that are just sparse files. So first, when you access a directory that wasn't available locally, the directory is filled with sparse files that have all the metadata but don't have the data and when you access the data, the data is being downloaded. So basically you have two, I think Windows had three. It doesn't matter. I guess what I'm going to say, so you have two states. You know the name of the file and it's sparse or you know it's data and everything. Yeah. So there's two states total. That's right, there's different implementation with different states but yeah, that's the basic thing. So here is an implementation of an HSM engine using upstream FA Notify. It can be done because FA Notify was introduced for the anti-virus case. So it knows to intercept access to files in order to scan for viruses. So you can intercept the file on open event, open permission event and if the file is sparse, placeholder, you can use that open due to fill the content from the cloud and when you want to evict the content of the file, I have a POC for that. It's pretty simple. You can use exclusive write list. Take exclusive write list on the file, punch hole. Everything works great. I mean it's very naive HSM but it works. It has many limitations like you'd have to fill the entire movie that you want to watch at open time. So it's not really practical for modern use but it works. The other part of the DMAP or HSM API is monitoring which files were modified, which files are dirty that need to be uploaded to the cloud. I guess Windows has that internally with NTFS anyway but with FA Notify at least since 4.20 or 5.1, you can at least watch over the entire file system for modification events and this is important for our use case because we typically deal with many millions of files, even only in the cache, in the front, fast tier, there could be billions of files and of course the slower tier may have larger namespace so it's important to have some sort of large scale way to monitor for changes. So what I've done in order to facilitate modern HSM, what's in this slide is POC patches. This is not upstream yet although the last one is. It's POC patches, I've posted them or maybe just posted the link but they're simple. They are not very controversial, they are small patches, small changes that can be done to FA Notify to make it available for good HSM implementation. The first one is a lookup permission command. The lookup permission just allows you to populate the directory on demand when it's not available locally. Report access range is just to facilitate the access permission event with access range information. So when you're watching a movie file from the cloud you get a notification for access to a specific range and you can feel this range, very simple. The last one, a victible marks, I'm not gonna talk about it right now but it's already upstream. It has to do with scaling, scaling FA Notify to be able to deal with a lot of files and the one before last, crash safe change tracking. That's something that I've been working on for a long time now. I've had several solutions, I've talked about them in the controversy, I had overlay snapshots and then I had a different version of overlay snapshots using FS Notify as the hooks. But essentially it boils down to, in addition to open permission event and access permission event, you need pre-create, pre-name chain, pre-name space changes events. You need pre-create, pre-delete, these sorts of events blocking in order to be able to track the changes in a crash consistent way, whether it's in a database or whatever. What I do internally in our product is I store the markers for this directory has been changed inside the same file system. So I get a crash consistent state between the file system and the markers for changes. I just use directories. I do make dear of some, the ID, the ID of the directory as an indication that something has changed in the directory. So I get crash consistent safety, at least in XFS for that part, for free. This is a demo that I don't have time to show of utilizing the extensions that I've shown before. I just wanna have this slide just to show an example of what I did. I took, I just looked for a simple upstream fuse implementation which falls under the category of HSM. So I just pick one HTTP DFS. It's read only HSM. It takes a URL to an HTTP website. And as you access it, it lazily downloads the index listing and lazily fills the files, even ranges of files. So if you do the command that's listed on the second line, like you mount the fuse file system to kernel.org repository. And then you want to do, to extract the head of the tar ball of this firmware and list the beginning of the listing of the tar ball. Then it lazily populates the directories on the path from the website into the local cache directory. And then it just downloads the first megabyte of this tar ball and uses the local cache to open it. So I took this as is, and it works. I mean, if you do this example with the upstream HTTP DFS, it works and it does everything lazily. I just added the dash dash FA notify. And I replaced the fuse implementation, dropping replacement with FA notify hooks. So when you do dash dash FA notify, slash www is no longer a fuse mount. It's a bind mount. It's a bind mount to the slash cache directory. But the bind mount slash www has an FA notify mark on it. So if you access through www, the LS, www, it intercepts the access permission event or lookup permission event and fills the directory. So you get the same, but without fuse. For this use case, anything to say here? No, stash www is just a bind mount to the slash cache. Slash cache is benign. If you do LS, you see nothing. You do LS on www, suddenly you see something and you go LS on slash cache and you see the same thing. If we have time, I can show the demo. But I wanted to get, not to this slide, which is complex, but to the next one, which is more complex. I want to try to get you to understand in order to try to get you to understand the problem in order to try to be able to sell the solution. So this is not so complicated. Maybe I made it complicated, but this is FA notify today. So FA notify today, this is the events happen in this order by time. This happens and then this and then that. This is one process that's trying to access the file. Right? No, it's clone file range. Nevermind. Some access to a file. And this is another process that's doing free thought for some reason. Maybe it's an LVM snapshot or don't know. And this process generates an access permission event, a blocking access permission event because the file was accessed. And then the Antivirus agent or my HSM implementation intercepted access permission event because the source of the copy file range file is a placeholder. It doesn't have data, for example. So it intercepts the call and then it needs to fill the file. It needs to fill the source of the clone file range. So it needs to write into the file. So then it goes into write system call and here we seem to have a loop, right? Because this process is writing is going to generate another permission event for the right. So inherently it looks like we have a loop but that was already dealt with when FNRTify was first merged because it's the same loop. Antivirus engine wants to scan the file. It needs to read the content of the file now to scan it. So this is solved by that FNRTify providing a special file descriptor which has the F mode no notify flag set. So the listener gets special privileges to access the file without generating events. So that's the basic loop that's easily prevented but I'm showing another deadlock here, another protection deadlock that I think is currently possible in upstream. What happens is this is why I use copy file range and not just read. Copy file range takes freeze protection before copy file range starts. Then the process that's handling the, maybe Antivirus scanner or myHSM, maybe Antivirus scanner wants to move the file to quarantine. It wants to make a change. So it goes to the file system and wants to make a change or write a log file or whatever and that change is going to also want to take freeze protection. And if it's on the same file system like you're moving the file to quarantine, you want to take freeze protection from this thread but if this guy comes in between and starts to freeze, then this one is blocked by the freezer. It cannot start the right and this guy is blocking the freezer from continuing so you have a deadlock. I'm sorry if that wasn't clear. It's a deadlock that I think is currently possible. I mean, it's pretty rare maybe but too rare so I don't know why nobody noticed it but exactly. So this is a deadlock that, so I wanted to solve this deadlock because it's probably more common in myHSM example than with Antivirus scanners. So the way that I approached it is currently the access events are hooked from within the LSM security hooks. So the LSM security hook has a hook security file permission or something like that and FS notify hooks into that LSM hook. So first of all, I added an explicit FS notify hook before taking freeze protection. Okay, I'm not inside the LSM, I added another one and I annotated all the FS notify hooks, also the ones within the LSM, I annotated them whether or not they are called with freeze protection taken or without and the ones that are taken without freeze protection are reported as with the flag fan pre-VFS and then the implementation, the engine knows if it's safe to do a write or not and then it has the option, for example, to just block the access instead of filling the file. But in our case, because there is a hook before copy file range starts, the implementation gets an opportunity to fill the file and then the second hook will most likely be okay. That's the solution that I gave to this problem. Maybe there are other solutions, I don't know, but this wasn't hard to do, that wasn't so hard to do. Maybe questions about before I go to this slide. I guess what bothers me is that the example I was thinking of was Apple I think and Windows, their explorer, their GUI had to be aware of these kinds of things. So the app really needed to know whether the file was offline and all that stuff and needed to know that. So they exposed this because otherwise you have stupid things that read icons and the first 50 bytes get read, which causes you to, but you don't know that they're trying to read the first 50 bytes for every single file in the directory to populate some pretty little. Okay, you're talking about the offline bit that gives a hint to the applications, don't poke into this file unless you really want to read the content of this file because it has consequences. Well, the HSM solution is independent of that. I mean, applications can, I mean, we can, for example, talk about exporting offline bit with Stadex again, but it's independent of... Well, the reason I was asking, would it reduce the chance that this would be an issue? Because literally in a Mac, every single file is gonna be read. No, that's fine. Yeah, so, okay, assuming that you got the last slide, so I need to... What does the center box represent? Okay, so I didn't say what it is, but at the previous slide, I just said, well, somebody's doing LVM snapshot. Somebody's doing freeze-thaw. And if you cannot complete the freeze, you will not get to thaw, then you're in deadlock. That's what it stands for. But actually, it's not arbitrary because in my HSM, HSM does do freeze-thaw. At least my implementation does freeze-thaw because the way it works for collecting changes, it collects them in a batch in periods. What changed between today and yesterday and today? And to be able to be crash-safe and not say that I've recorded everything, everything that's on disk is in my database, then I do a freeze-thaw between the periods. So it's actually part of the HSM that's creating this deadlock. Okay, so the use case here has to do with change tracking. So it's similar to before, but instead of looking at the access permission event, I'm looking at a pre-create event. So pre-create doesn't exist yet. It's just part of the patches that I posted. But if we merge a pre-create event, we would end up with a problem like this. So the process on the right is just doing make-deer. It's doing make-deer, just pick make-deer, it doesn't matter what. HSM, the change tracker of the HSM, gets an event, a pre-create event and needs to record the fact that something has changed. Now, my first initial and naive implementation used the LSM hooks. There are LSM hooks for security, I don't make-deer. And just as if I used to hook to access an open, I hooked in to make-deer, generated create permission event, I didn't call it pre-create, I called it create permission. But then the HSM does something when it gets, in my example, it creates a directory, but it can do something else. And besides the loop that we've talked about before, it's creating a directory, it will create an event. This is solved very similarly by using an opath file descriptor that has the no-notify flag. So this loop, the simple loop is solved. But the other problem is the freeze deadlock again. So, okay, in order to solve the freeze deadlock, it's the same deadlock as before, I did the same thing, I didn't need the pre-VFS flags, I just moved the hook, which is a new hook, before taking freeze protection. Let's get to that, let's get to that. First of all, I was avoiding the obvious deadlock. And I did not introduce the red hook there, I just introduced an earlier hook before taking freeze protection. So HSM is always free to write to the file system. Events happened, I'm going to change the file system, please record the change before I make the change. That works to prevent the deadlock. But then I get to the races part of the problem. The race part of the problem is I get an event before making the modification. I record the event, say the engine looks at the file, it is changed, it backs the file up to the cloud, but the modification didn't happen yet. So now the modification happens and it's not being recorded because I've already did the freeze, so close everything and say everything is in sync and now the rest of the change happens invisibly. And you have a change in the file system that's not recorded, so that's a trade-off between true problems, the deadlock and the race problem and the way that John proposed to solve this is to wrap the file system modification operations in a sleeping RCU read context. So actually the way it's implemented is inside MNT won't write, I credit variants of it, MNT won't write and file start write. We embed also a start of an SRCU, a sleeping RCU read side section, which covers the entire modification period, from before you start to after you did it. And then, for example, in my AHSM, either I don't do freeze store at all because I don't really have to, or I do freeze store, but before that, yeah, SB write barrier is a synchronized SRCU. And then I'm able to maintain a view of a change tracking period. There are overlapping periods, right? You have all the changes from this time to that time, you have all the changes from this time to that time, where the overlapping period where you started to do SB write barrier means that all the changes that are being done right from here to here are recorded in both previous period and current period. So you never get to a state where you're always in a state that if you're looking, if anything is dirty, you may get a false positive of things that are dirty, but there is nothing dirty, but you cannot miss changes. Looking at your diagram, you still actually have the deadlock, don't you? Because in the third, the right hand box, you've got SB start write SRCU before that, but it looks like if that happens after the SB write, but we usually got your deadlock. So SB write barrier doesn't stop this one from proceeding. It does not. It just waits for all the ones that started to complete. That's the difference between the freeze protection and because the sleeping SRCU doesn't stop from new read sections to begin. It just waits for when it started to complete. So like the start write SRCU is not a blocking operation. Like basically how our RCU works and SRCU as well is that it just basically increments some counter and that's all. And the barrier just knows that to wait for everything that has started before it was called, but the new regions can be still started. So the red arrows do not exist. The red arrows are the prototype, they do not exist because this is not upstream, right? This is a plan. The red arrows are the first bad plan. The blue arrows are the plan that's deadlock free and the yellow ones, the attempt to solve the race. I told you ahead that it's going to be challenging. We're a bit out of, I guess I've shown what I wanted to show and we're out of time. We can also move the FS info talk to the VFS session tomorrow or start to talk about mount namespace changes, mount namespace changes, notifications now, since we're on the subject of notifications. What do you say? FS info at the empty slot of VFS mini-conference tomorrow and we start, yeah, yeah, yeah, Christian, talk to you. Yes, yes, yes. I was proposing since we now need to have a slot about FS info slash mount changes, notifications, which is a bit much for a single session even. Maybe we improvise and do, just try to do mount notifications now and FS info tomorrow. Nicholas, do you want to lead this or maybe I can start with the F and notify mount events. Maybe it's related. Okay, anyway, I didn't get any more questions. I didn't get to one part though. First of all, anyone wants to shoot down the SP write barrier before we continue? No, okay, Joseph, yes. So I don't love it, I assume you've already looked at how we could probably do this with the existing freeze-thaw thing, like at another level perhaps. Yeah, so we were looking into how to do this but like current freeze-thaw has the problem exactly that it's relatively expensive operation because basically before you can tell that the file is frozen, you have to finish all the writes, flush all the dirty data and stuff like this. This is much cheaper operation because essentially it doesn't have to flush any dirty data because this really asks only about have all the writes, like wait for all the writes to finish from the moment I've asked you. So it is much less disruptive to the rest of the system while freeze is very disruptive to everything that's running on the system. This is disruptive essentially only for the one who is calling the barrier. Yes, so like really this, like because if we are not speaking about HSM, which is disruptive anyway, but if we are speaking about other use cases of like persistent file modification tracking, you ultimately want also unprivileged applications to be able to track changes, possibly not on the whole file system but on some sub hierarchy of the file system or sub directory. And then you don't want these applications to be able to hog your file systems with essentially syncs and stuff like this. So we wanted some lightweight wait for these applications to be able that all the writes to the part of the file system they are interested in are finished without imposing the performance penalty on everybody else. I forgot to say also, if I have SB write barrier, I don't need to do freeze so. It's enough to do SB write barrier and sync FS. So sync FS is sort of like this because it starts something and flashes all the dirty inodes that have been dirty to this point. So it's similar, it's a complimentary API. And I was hoping that Jeff would say, yeah, that sounds interesting. I could use that for, you forgot, huh? I can use that because there was a problem discussed on the I version. Discussion, when does I version need to be bumped before the operation, after the operation? It's not clear. But it's the same conceptually, it's the same problem. If you introduce this concept, maybe it can solve other problems. And in this respect, John was hoping that we can introduce this to VFS and we can have SB write barrier and just use it. One problem is that test bots found the regression for like small writes. Because small writes do not take locks and there's a memory barrier. TMPFS, it was like visible on TMPFS write because the additional cost of taking the extraction line for SRCU were actually visible on TMPFS write. It was not terrible, but it was like 20% observable. It was observable. I solved this, I mean, I fixed the regression for my use case because what I did is for writes, I took the SRCU, there's a SRCU context which gets up to the FS notify hook and you just take it conditionally whether you're handling the event or not. But it's not a generic solution. Then we thought maybe we can use the I version observed bit to decide whether or not to take SRCU. Yeah, so the question is whether you want like similar to freeze protection, whether you want to grab the SRCU unconditionally or whether we want to somehow hide it and grab it only when someone is actually interested in this to kind of reduce the overhead for like fast systems like TMPFS which don't really care actually. So I don't know how to generalize this, but let's move on to the next. So for NFS it might be like harder. If you want to use it on NFS, basically when exporting you have to enable like I now need the SRCU grabbed for the fine system I'm exporting. So there it would have to happen like at this moment.