 All right, so it's not a lot of slides. This is going to recap just briefly about the history, because I like this to be more about a dialogue. The original motivation for this is to actually get S3 and S4 to suspend for file systems that have been actually broken for years. It might come as a surprise to some folks that file system freezing is broke, but it actually is pretty broken. And it has been for years. And the problem really stems from the fact that we don't have a unified way to automatically freeze file systems when we're going to suspend or hibernate. When you're doing a lot of IO, what ends up happening is, essentially, you just get a hang. And that, obviously, is not a pleasant experience for users trying to come back from Resume. Restore pages, no. So this is an eight-year-old problem. Jerry Cosina had basically described this issue with the K-thread freezer and the semantics that we used in the kernel about eight years ago. I think that was the South Korea kernel summit or something like that. I forget which one. K-thread freezer APIs are basically just used sloppily. And it came out of the idea to try to help file systems with the goal of making this suspend much easier. A while ago, maybe about three years ago, we discussed some of these patches. And a few issues came up. I was like, all right, well, C-group freezing is broken. Is it still broken? Or is anyone familiar here, if that's still the case? No, it's fixed. What did you do? Oh, fantastic. Thank you. That's a great help. And took how many years to get that fixed? Oh, wow. OK, look at that new file system. Great. So freezing races on Audemont. Is that still an issue? Or are you folks familiar with that? No? Anyone? All right, well, I guess we'll have to keep that in mind. Ordering changes, this is a bit maybe more of the complex things here to address. And it basically implicates that in the future, we may need a graph to basically keep the ordering of the super blocks. Alviro regretted and lamented the fact that he had actually implemented support for an IACTL called Loop Change FD in order for Fedora to support live installations and basically allow you to immediately jump over to the installed system. That basically breaks the ordering. And as such, if you do suspend that system, then this assumption that we're making when we're iterating through supers would be broken. So we have to just be aware that suspend would be broken if you freshly install that system and use that feature. I'm not sure if Fedora will move away or migrate away from that, or if other distributions are supporting that feature. Just keep that in mind if we get this merged. RAID is another example where we have different devices. Maybe the ordering is not exactly the same. The concept here is essentially we want to iterate over the super blocks and the assumption is that we can iterate backwards to also do the suspend and the freeze. In theory, if you, so long as that's simple, things should work. In most of the cases for users on laptops or mobile devices, that should, in theory, work. That's where people are doing it. I don't think folks on RAID are actually doing suspend as 3, 2, as 4, anyone? Yeah, that's what I suspect. So long as we keep this in mind, we should move forward. Long term, though, we do need a graph. If folks are interested in helping to implement the proper graph in C, I do have some initial code. If anyone is interested in that, because I do have some initial template code for that. I just haven't looked at it for a while. You are? Yeah, yes? Oh, just at the bottom, you gotta press it all the way. This might have been better with the block people in the room as well. Because what are the things you talk about as you touch on the block there, I think? Which is the file systems? Well, the ordering is on the super block. So it is technically self-contained. We do have a super block for the block cache, but that's a bit different. We don't really, no, we don't deal with iterating supers. We're only supporting this when we have a backing device. So it's not really... Yeah, but it's kind of a combination of block devices and super blocks together. Because with stuff like loopback devices and stuff like this, you really have a combination that you create a dependency, essentially between two. Oh, and in that sense, yes, yes. And it is a good point that that dependency does exist in complexities for topologies for block devices. So in the end, I agree we are interested in the ordering of, basically we are interested in the partial order of the super blocks so that we can walk it from the back and freeze the file systems one or from the top and freeze the file systems one by one so that we don't create deadlocks here because we need to freeze the top first so that the file system underneath is still possibly contained or can still take the IO. But, yeah. It does beg the question then what the future layout for this might or should look like right too. Didn't you try to repost my old patch that just does the freezing in reverse? Yeah, yeah, yeah. So that's basically what's done. But this begs another question, right? For now, if we want to, we can dedicate our time towards the future of what this should look like. But I think that, oh, the other thing was mentioned was we actually don't have a notification to use space. This was mentioned a while ago. So it should be relatively simple to do that but to let an informed applications know that we are going to do automatic suspend and freeze so that way they can quiet us anywhere. So that doesn't exist. I'm not sure if that was ever added. But we could just add that simply. I'm not sure what user space would prefer. Maybe, is Leonard here? Maybe, he's got to talk. But to be informed of an impending freeze or a freeze happening could be interesting. The service could be frozen, for example, as well. What should we, anyone have a current hunch on what we should do in terms of user space? Yeah, so, I don't know that I can answer that question because the general problem, certainly Windows applications have seen this, is you have to give the user application a certain amount of time to QIES. But then if they don't actually QIES within some timeout, say 15, 30 seconds, you just simply have to go on without them. And so typically, the actual notification I don't think is hard. That's just plumbing. You use, yeah, debuffs or whatnot. The other thing I want to note here, though, is there is also network block devices that very much are a potential issue. That was actually identified eight plus years ago. And everyone just sort of said, yeah, that's hard. And they all backed away slowly. Yeah, so this is certainly hard. But I believe the notification as such is probably more of a user space problem. Like, I guess it would be perfectly fine if basically like the notification went on debuffs or wherever, yeah, where the applications can listen to it. And simply, if someone does something like a co-suspend to some proc file, yeah, which happens to suspend and, OK, that's going to happen without notification. But if you trigger it from your IDE or whatever by some tooling, then that should be wired up to generate the debuffs notification. And all the stuff, as you said, like some time out handling and stuff like this. But I believe this all has to be done. So for the kernel stuff, I don't think we really need to implement any notification as such. There is possibly one added complexity. And that's things like Fuse, where you have to stop the kernel bit. Now that you mentioned it, the checkpoint with start guys have all been doing this for a long time. Dealing with specific complex issues like Fuse right now. So complex dependencies has been a long-term thing for different areas of the kernel. Even at build time, we have this. And this is why I had resolved a simple DAG at build time using ELF and using ELF sections. This is the whole old linker table stuff. And that basically allows you to create a DAG. So that way, you can then, if you want to, say no op on these ELF sections, for instance. And then you can just do this at link order. You just sort the link sections. And then you have a DAG. You basically let the linker do it. But that's on the build time and linker time. This is a dynamic dependency, right? But yeah, there's different areas of the kernel. It begs the question, what are other areas of the kernel? Do we need a DAG for it? And how do we implement it? Is it possible that we can make a DAG that user space can register bits on? Says there. Implementing directly, classically, graph is trivial. The question is really to gather the information and gather the dependencies. I wouldn't say it's difficult. It's just you have to do it in a lot of places. Thinking about all the cases, like fuse properly, adding the dependency on the file systems, which on the resources it actually needs. And this is actually where the file system usually depends on the device. The device may depend, again, on the file system. Or we may have file system dependencies. So this is going to require a lot of research and poking in many places to actually properly do it. So I don't think it's complex, but it involves a lot of place. So obviously, this seems like a beginner question. But when I look through XFS test to see if we have any tests for this, I find a dozen examples where we test freezing. And all those freezing examples appear. Maybe I'm misunderstanding this. But at a high level to be, OK, we've got this XFS feature, XFS freeze. And it calls on Iocable to tell the file system to maybe do a snapshot or save the state at this point. So if something bad happens when we shut down, it's OK. But what I couldn't figure out, and it's really hard to figure out, is does anybody other than XFS support it? So basically, all the file systems or the standard bulk device file systems support freezing. We have support for this in VFS. And for the simple file systems, what's in VFS is enough. For the more advanced file systems, like XFS, EXT for BTRFS, they have their own support for freezing. So basically, I would say every file system now has the freezing. The thing is that what we test in FS test is the freezing of the individual file system. This is more about that we need to make the kernel freeze the file system when it is about to suspend. That's what's not happening currently. Currently, we just freeze the file system. Like we just let the file system live. Just gripping the tree, right? I think I found maybe there were very few places like we look for the word freeze that show up. So like, what's the entry point? I mean, there's an Iocable for XFS, but because I think only a few FS exposed that Iocable. So no, no, no. Yeah, yeah, yeah. Yeah, yeah, many file systems support it really. The Iocable is the real entry point. And if the file system is not obliged to implement the freeze FS method, yeah, there is super block method freeze FS, but actually a lot of file systems don't even need this freeze FS method. Because we have freeze super function, which takes care of writing back all the data, blocking all the writes and stuff like this. The file system doesn't need anything else, like blocking its internal threads, possibly doing writes to the journal and stuff like this. If our system doesn't need any of these fancy stuff, then the VFS does all the stuff for it. Yeah, yeah. It's really only for file systems that wanna do something safer than just letting the file system write it all out. Like for example, actually committing a transaction. So in case the laptop never comes back from the suspend, that data is more likely to be safe. But strictly speaking, that's not necessary. And so you can freeze say a VFAT file system even though it doesn't have freeze anywhere in the code. Right? So with network file systems, presumably they want to do things like returning leases, if that can be done quickly. I was just thinking data. That's actually a really good point because I was dealing with a server resource can trade flow control thing recently where the server is gonna, those leases are somewhat expensive for the server to be tracking. And if we know we're about to freeze, yeah, release all the leases, it's much cleaner. But work file systems expose its FSTLs, better look. So for you to do this, the freeze FSTL block method is the replace you should be doing this in. Yeah. Okay, but maybe we can return back to the case. Well, Leonard's here, right? Hey, so one of the things that we were discussing earlier we didn't have a clear solution for was a user space notification for the fact that freezing of file systems is going to happen to allow applications sometime to QIS. Yeah, so I mean, one of the things that occurs to me given that I do know that, this is what I use at my laptop. So I have the hacky RTC timer thing to do hybrid suspend, which basically it suspends to RAM for a bit. And then after five minutes it detects that if it's still suspended then it goes into hibernation. So the question would be, would that be a good method here? So that way we actually don't suspend to RAM using the RTC timer to allow applications to QIS after, I don't know, three minutes or user configurable setting. Then we do the issue, the suspend. So then user space gets at least a notification but the question would be what notification should we send to user space or how? So actually we have a lot of infrastructure for things like that, like before we go to suspend, like people can, like in user space allocate something like how do you call that, like that blocks basically suspend until they finished and it comes with a timeout. So if they don't finish by some time we go to suspend anyway, but that's entirely user space concept. Also the thing that you were just describing, the hybrid, like this, we call that suspend to hibernate. No, suspend then hibernate. That's actually implemented also, like systemd has logic for that and it can actually look at the battery and then make different decisions. But I mean we never call the freeze from user space directly, right? That's the thing, so like given that let's say user or user space defines that they do want to go into S3 or S4, the kernel will do some work prior to doing that to freeze the file system. So that way we stop IO in path. But it was requested maybe years ago that prior to doing that, it might make sense to notify user space that this event is about to happen to allow gracefully applications to also try to slow down a bit or do something. Oh, it is, okay. So problem's taken care of then. Well, I mean it is taking care of applications actually want these notifications, of course, but we have these notifications that they get notifications early. We are going to go into suspend and then they can either do something or not. And if they ask nicely, they get a couple of seconds time before we'll continue and then they have to report back to us that they do this. By the way regarding freezing, like something that we are thinking about which might be relevant in this context is. So we currently have this problem when we implemented the thing that you mentioned was the suspend than hibernate that when we come back from a suspend and then decide to go back to hibernate, all of user space starts running again, right? Like for a brief moment of time. That's just stupid, right? Like because it's not supposed, like we are still in the sleep conceptually except that we are not. So what we like, we want to use a secret freezer for this actually so that we can basically freeze all of user space like most of the secret tree basically except for this little thing that actually is the one that checks what the battery status is and things like that and then goes either back to suspend or to hibernate. Is that somehow relevant? Like this thing is probably gonna be like a pinned process that is supposed to run with very little resources and it's supposed to be like the only user space process in that moment that is actually running. So I believe from kernel point of view and from the file systems we are discussing here we have to unfreeze anyway, yeah? Because your process will possibly need to access the file system and stuff like this. So yeah, we are discussing this actually that we need to freeze file systems before suspending so that the on disk state is actually consistent because like that provides better behavior when actually then you decide to power off the machine instead of like actually resuming and stuff like this. So yeah, so from kernel point of view I don't think actually there will be difference at least from the file systems. Like we need to unfreeze the file systems for your application to be able to check the battery and decide and then we can freeze them again when going to hibernate. Some people suggested that this binary we shouldn't run off the root file system but run it off a MFD or something like this. I never wanted to do this but if you basically just telling me now that I don't have to then I'm happy. Yeah, I guess you don't have to. Like we could make it work even without it and I think it will be even simpler for us. It would be much simpler for me too because doing that would mean we would have to compile static binaries and shit and I'd rather not do that. All right, now it seems like user space seems to be almost solved. So I guess now just review the patches. Chin or just provide one comment. I don't think it's a blocker. I think there's a way to out. Just review your patches if you think that this is useful or important. Other than that what I really wanted to talk about was what's next? It seems like after eight years we probably might be emerging some of this stuff soon. It begs the question what to do for the other subsystem. Remember this is the K-thread freezer API stuff. This is basically just hacks in the kernel. It basically checks should we try to freeze and then a whole bunch of flags like work use and stuff that would do the same thing for them. There's ways to basically remove this. This is all done in Coxsinole. I basically am removing all this stuff through with Coxsinole and eat one file system at a time. Three years it's proven to work. So I've actually manually reviewed the output of the patches. They seem sensible. But it begs the question once we're done with file systems should we just go AWOL and basically just remove this one subsystem at a time? Is there any concern? Are other subsystems basically using the K-thread freezer API is in an incorrect way at this point in time? Or things that perhaps we didn't think about. Remember the K-thread freezer API was introduced to allow file systems to stop IO in flight. But at this point in time, the question is are other subsystems using it for other things that we didn't think about? Also, well, Secret V2 does have a new freezer API. The Secret V1 one still exists. And that as far as I understand uses the old freezer stuff. So I mean, I don't remember what I wasn't following it very closely. My understanding is there is a reason why. They didn't want to change the old way it was done. And so that's why only Secret V2 has the new one. I suspect that if it's a question of removing everything, there will need to be the discussion about why. Like, can we change to it? Can you compile a kernel without Secret V1, do you notice? Like that you can say only one Secret V2 is supported? But file systems have the same issues as NFS or AFS does. Save stores. And possibly also KNF, SD, KS, and BD. I don't know whether we're interested in freezing those, because normally those will be on the server. Yeah, fair enough. In that case, those need to deal with inbound IO. It's one that's already in progress. If it can't be canceled, we'll just shut the TCP connection. If you've got a TCP connection. Oh, yeah. So in the example of SMB, there's persistent handles. So if you took the system down and reconnected within a reasonable amount of time, then the state will be preserved. So you won't lose any data. But I think the risk is you want to reject certain types of incoming requests, maybe reject and open. There's no point in accepting any new open requests if you're about to shut down, right? Even Fuse just returns not supported, right? So if you tried to freeze the file system today, there are only nine. Yeah, so maybe really the networking file system should be handling the free CFS callback and do the stuff to shut down things. Because that's basically the notification for the file systems. So free CFS gets called when the BFS has already blocked all the write, blocked all the page fault for the file system, flushed all the dirty data. And then it's time for the file system to actually clean up everything as a preparation for the freezing. And you get unfreeze FS call when you are returning back. So I believe these two callbacks should be used. But if something isn't working there for you, then we can certainly talk. The process is already frozen. And then it calls the freeze call or the process is still running. What they did, guys, for checkpoint response, what do they want to do? That's independent. So freezing for Fuse is the same ordering problem, but complicated by the fact that user space is involved. And the other thing Fuse could do is make the waiting for requests to be replied making that freeze call. But the problem with that is that operation could hold the BFS lock, a new text or anything. And so even if we make it freeze call, if the operation is holding something, then something might be waiting on that lock. And that won't be freeze call, so that will block freezing. So it's a difficult issue for Fuse. See it from an Android phone or see it from a Google Chromebook. You want to freeze it or even a laptop. A laptop, typically, you don't have NFS. You have, I have, but most of the people on the world don't. So from priority, Fuse is probably higher than NFS, but it needs to be solved anyway. So I guess for the other subsystems, we'll just take it one at a time. And for network file systems, it seems it's going to take some time, too. But that's why we have a flag for the super block. So if you don't want to support this for now and not sure yet, just don't add the flag. All right, we're going to move on to the next topic, right? So.