 The purpose of this session is to talk about next steps for express online repair. The first part of the repair bits are out for review and Dave Chinner's laptop right now and that means that it's time to start talking about the other missing pieces. Which means I'm actually going to talk more about user space today than I would normally would. So the state of things in user space right now is that we have a user space driver program that controls how fast we engage the online this mechanism. And that's about it. There's not really anything we don't really have anything set up right now for what do we do in terms of notifying user space that we've found something weird in the file system. We don't really have any demon or anything like that monitoring any such notifications to actually issue repair requests other than the general will go run the XFS scrub tool without dash and and it will go and fix everything. And, you know, the, and so there's, there's those pieces of where we don't really have any good infrastructure in the kernel to actually handle that sort of thing and figure out what's going on and dispatch all of that. There's a, oh, hello. There's a bunch of interesting questions that have come up this afternoon about letting user space mount file systems without privileges. Earlier Leonard I think was talking about some kind of system demon that would run this before mounting a file system at the request of some random container that is not itself, you know, root in the root namespace or anything like that. So I figure it sounds like a good time to actually talk about things like that. And of course the last bit that I wanted to cover today is the actual scrub service itself. Right now as I mentioned there's a driver's program that you can run from the command line called XS scrub. It does all of the like open block devices and the and the root directory start issuing eichels and whatnot. But the real use case for this is not to be running XS scrub from the CLI. It's really more too late run as a background system to service that fires up periodically once a month or whatever you configure it to be. And then we'll go and run in the background. There's a few weird problems with this approach. One is that I have no idea what I'm doing when I'm writing a system to service definition, and have no idea if it's actually like, useful, safe, non crazy. I mean system deanalyze says it's fine. So, so I guess that's good enough for now. But I was, the entire thing is, you know, basically not that far off from the old insert cron job run it in the background and hope the system administrator is actually watching the system logs. So what I'm interested in is setting up some kind of notification system so that we could actually respond to events dynamically. So having some means for other applications to do things like invoke a background scrub because say there's a, you know, you, you, I could envision that you might want to have like a container host supervisor process that would say hey it's pretty quiet on the system right now why don't we go kick off a background scrub of all the remaining mounted file systems, or possibly. I don't know how this fits in with the unprivileged container amounts of XFS use case but presumably conceivable you could not the file system run and run an online fisc thing. Before you actually present it to containers or even run it while the containers are running if they're long lived containers which I don't know. So there's a basically I have a whole bunch of questions about like, does anyone on the user space side of things care about these things. And do they have any particular questions or wants or demands or inquiries about any of this, because I can totally keep going with my kernel colored glasses on but you not that doesn't always produce a satisfactory result. So that that's my prompt and I'm kind of curious since I can't actually see who else is in the room after the tea break. If anybody in the audience has any questions or thoughts about that. We're going to try to bring Glenn at the back in case you didn't put anything in on your title that mentioned this but going to try and call him back. And the other thing while we're waiting for him, I would suggest is talking to the distro folks, because like a lot of the butter FS related policies around scrubbing and that sort of thing. We're mostly informed by the fedora guys, right? So, you know, I have my opinions, you have your opinions for kernel guys, we have our opinions. But user space tends to have very different opinions and I for this sort of thing for policies, I tend to rely on them so distro guys for sure is a really good avenue for this. Okay, we have Leonard back in the room. So after that, you want to go first and then you can repeat your question to Leonard. Yeah, so I think the other bit of context when you talked about running running online scrub for untrusted file systems. The specific context there was if the unprivileged untrusted container admin presents a file system which has deliberate maliciously constructed file system metadata. We want to make sure that the file system has been checked to be safe to mount. And that the hypothesis is that to the 99th percentile running offline FSCK before the file system and mounted might be good enough for at least some file systems. Because a lot of that depends on the quality of the FSCK for that particular file system. But because the context was, you know, this file system may have a sysbot level inspired malicious metadata where the checksum checks out but it's designed to provoke a buffer overrun in the kernel. You want to actually run FSCK before you mount it. And so that's really out of scope for online scrub. That's sort of a separable issue from the scrubbing where what we're talking about is whether you are creating a snapshot and then running FSCK on the snapshot, which all the file systems can support or XFS is, you know, on the brink of declaring fully supported, you know, a kernel level FSCK that doesn't require the snapshot. I think it's sort of, you know, that's separate. That's a separable question. And that's one where I think it is useful to get, you know, the distros and Leonard's opinions about when is that appropriate to do. And, you know, let's ignore the security question of, you know, if we're running if FSCK is running in kernel context. You know, how can we be sure that the, you know, kernel online FSCK code, you know, won't have buffer overruns because, you know, it's C code and I at least don't know how to write bug free C code. You know, off the top of my head. Derek, please give a quick overview for Leonard about your project and the questions that you want to ask him. Right. Yeah, thanks Ted. That was a pretty good summer summarization or introduction. So, Leonard, the thing I wanted to talk to you about, ideally, what have been in person was that for the past six years we've been developing online FSCK capabilities for XFS that we can actually check file systems without having to one mount them. Well, originally this was so that you could spin up a long live system, run it and then detect latent errors and software bugs and things like that. But as we have discovered rather recently a lot of a lot more distros than the zero I thought there were will actually let you mount XFS file systems without privilege so that has in the last week or we could call us into hey maybe we should actually try using some of these repair tools and ideally the online one in concert with whatever it is that's not mounting XFS file systems and leaving them mounted. But that also brings up a whole bunch of interesting issues about things like how do we contain the driver program into a system D job. I have one of those. How do we, how do I actually interface with other parts of the system where other parts the other parts of the system are not necessarily defined or even exist at this point. So, you know, one thing, one thing we've also been building into XFS is the ability for it to notice when it encounters weird looking metadata or just outright bad metadata and actually set up flags. A thing that we have not yet tied into our the FS notify events that I think the xd4 is now tied into such that when when corruptions or lost data or whatever are detected. It can actually send a notification to user space along with some details of what went wrong. Ideally for XFS, since we generally know exactly what went wrong, we can we could encode that and whatever we send to user space in the hopes that if there's anybody listening they can actually schedule appropriate actions whether the appropriate actions are run, run FSCK, after unmounting for some file systems or running online FSCK, so that we don't incur downtime or just unmounting the file system and saying hey this is bad it went away by. But, you know, from Colonel land I don't really have a good, a lot of good visibility into what does user space really want to do. This probably would make a reasonably good plumbers topic. But alas, here I am at LSF. So I was hoping to get your thoughts about. Hey what is along the lines of hey what if we told you that there was a way to, to check XFS file systems so that we don't have to do this scary thing where we mount some random thing with someone found in the parking lot and who knows what it does to the So I mean it's definitely currently the way how you discs and that stuff works right like that they mount whatever you find and I always found that stupid and they're certain operating like Chrome OS or something who don't do this which do not do this which just do VFAT and user space and things like that and I kind of tell the GNOME and you discs people that that's probably what they should be doing as well right like having some user space implementation and VFAT and then focus only on VFAT because the use case they focus on is like USB sticks and people don't typically put XFS file systems on USB sticks right. So for that case I think that's the way out right and there's the other thing which is what my talk was about which that people want to be able to mount arbitrary file system images inside of containers. For that my story for trust was always something like DM verity and then we established trust before we pass anything to the kernel. As I understand then at least TED for XT4 gives the guarantee that MKFS sorry FS check is sufficient to establish trust for most cases. Is that the same for XFS people would you say basically that if I run XFS check XFS on an XFS file system that is trusted afterwards that you cannot exploit the kernel cannot trigger algorithmic us situations or anything like this. Well, that's a good question with the difficult because the end the instant I say yes, everybody in the world will launch their fuzzer rigs in order to try to find all the places where things that FSCK doesn't catch. In general, I think I would I would tend to I think I agree with Ted that it should work like it. This is this is what we should be doing, saying that, you know, beating on the repair tools until until we're generally confident that if it passes either either online or offline fisk that probably the file systems good. I would, but it's not an absolute guarantee because, hey, you know, I'm like Ted I'm not perfect at writing C code either. So, so I get the, the, the signal then that probably it's fine with XT4 and XFS and probably not so much fine was part of us. But, so you mentioned the online file system check a lot. That's not useful for establishing trust right like because it means that I first have to mount this stuff and then have to do the online files system check. That's not good enough right like because then it's already might have exploited the kernel right. But I understand don't understand the semantics of an online files system check to be honest. Yeah, the thing I don't know about the container images thing is, are we willing to trust that generally people won't be injecting malicious container images if they've been signed by whoever the distributor is, or that's the model like if it's signed it's good. But Ted suggested it's, it's fine to even mount unsigned stuff as long as we do a file system check an offline file system check in Ted's case but in your case. You also have the online file system check but that's not enough for establishing trust as I understand right. Yeah, yeah, yeah, my question is if the image is signed by the distributor, then it does our interest in running this drop to drop back to where we just want to make sure the image is okay and that there weren't any bugs in the distributors software. You know we're not really looking delicious. I actually do run FS check beforehand. The model is the model is generally I think, if the, if user space guarantees something in user space guarantees through signature or whatever that this is fine to mount, then this is the level of trust that is sufficient for the container use case it might still of course might still be exploitable they might have been tricked into signing something that is buggy and so on but that's, that's on user space like that's then not the colonels fault that the main pushed it. My main point had always been that the kernel can't be or in most cases can be responsible for for establishing a policy of when to trust an image because the amount of use cases that we have in user space are probably too vast to be considered and especially not generically across all file systems. So the mechanism that I propose and asked Leonard to work on was trust is established in user space through signature the verity that is keys in the kernel or whatever and that's good enough for mounting unprivileged file system and a whole different story I think is can an individual file system with a block device format guarantee I'm safe to be mount be mounted by unprivileged users and that question also comes into different flavors either it comes in the flavor. I an unprivileged process can for me mount this image like injected into my user namespace and mount namespace, or I can call I can, as in the container manager the container itself the workload payload can call the mount tool within the container and mount that XFS file system. And that second that specific solution comes with a lot of caveats because then it means whoever mounted that image also owns the super block. Because that's just currently how it works you could probably changes but in general you own the super block then you have access to all of the operations that belong to that super block. And you could destroy that file system, whatever like I think that's a way higher requirement for for trust at least how that's how I think about it. Hey guys, I think we're thinking about this wrong. I don't think FS check actually works for trusting the file system because a malicious malicious device can and will just change the device data underneath you. It can return one set of data when FS check is running and turn different data later. At least the three strategies to attack that like the first one is, we make a copy first. The second one is we check that the yeah we enabled FS variety on a thing and the third one is the client has already enabled MFD ceiling and it's a MFD. When FS Verity. When FS Verity is in use that then yes, but I don't think we can rely on that in general as files as an implementer we cannot rely on that in general that won't work for users mounting images on their desktop, or a user wanting to mount an image to the cloud. If you're mounting something that's conceptually a network block device you just can. I think it's you're absolutely correct. And that's not relevant for the security, you know, you know, whenever you do a security answer security question you have to have a threat model and make assumptions about your environment. And you are absolutely correct that if you pick up a thumb drive off of a parking lot. And, you know, it has what purports to be an XT for XFS file system, but in fact is a malicious smart device that can return different values at different points in time. Nothing will save you right, but I think what we're talking about here is the image is malicious. We copy that image into a block device that we trust. And then we run fsck on it right so it all. Yeah, I don't think we can model on what what you think is like, I don't think we can force, force the user to make that copy a part of their workflow. I can't control the user workflow to that degree. I think the responsible thing for us to be doing as file system implementer implementers is to start taking a little bit more seriously just hardening our code at runtime. And I wanted to ask Derek, what where XFS isn't the state from what I've seen of where XFS is with runtime verification XFS does verification on metadata at both read and write time as I understand it same as be cash us and also fuzz testing. I think we might not be in as bad of shape as we assume the thing that I think generally been missing is maybe better code coverage analysis I don't think anyone is doing that way that we should. Just to clarify it is part of making it part of the workload. It's not the user that does this it's like the mount them and that and then I gave a presentation about before so it's not under the users control is it's under the control of a privileged process on the system. Yeah, this is as soon as this becomes a generally accepted thing that you can do people are going to want to use it in more and more context and context where that just copying the whole image is not an option. People are going to be wanting to mount images in the cloud entrusted very soon. I think we're getting off off the rails here. Derek, Derek was asking user space specifically how would you want to get what notifications are useful. How do we want to deal with there's something wrong and I know there's something wrong, like policy questions that user space needs to answer. I think that's what we're looking for for Leonard. Yeah, that was the other half of what I wanted to know about is like, is what do we do about notifying user space and getting them to tell it, getting them to tell the excess utilities to actually do something. So, I don't know what the right policy, like, I mean, I'm not a storage guy right like I don't know what storage people would like to have for policies. I can just tell you like, if I think about disk images. I think the way I think that system system should be composed these days is that you like every service is a stack of file system images and the host is a stack of file system images. So if any of these file system images actually trigger some kind of failure, we should probably localize that and then have like probably the smartest policy is to just shut down that specific service right. So, yeah, if you give me a notification about some specific file system being bad, right, like we could suddenly easily hooks it up so that the specific service goes down. So, like, similarly, like, for example, we handle an OM event from, from, from OMD or something like this, right, like where we get the notification that services misbehaving now we kill it. So it could also be, oh my God, the file system image that backs the services bad, then we just kill it and create a reportable event. So that would make a ton of sense to me, but I don't really know what storage users really want because I'm not a really storage guy. Yeah, I suspect the people we need to talk to are the people who are actually running these services, right. So, for example, what do the Kubernetes people want. That's who we should actually really talk to because they're the ones where if we can say, you know, the file system on this particular block device has gone inconsistent and it could be because of a hardware fault, it could be because of a kernel bug, then the Kubernetes people would want to shut down those jobs that are related to that particular device, possibly giving them a short grace period where the job can, like, tell the world, I'm going away, goodbye cruel world, that sort of thing. But like, it's the people who are running, who are maintaining those systems, who will care. We added that support in EXT-4 specifically for my company's internal Kubernetes-like thing, i.e. Borg, and that's what they wanted to do. They wanted to shut down services when a file system became inconsistent, and certainly what we did was good enough for that particular use case. Now, is it good enough for Kubernetes? We need to ask the Kubernetes folks. I mean, I think things are a little bit different here, working for a cough, well-known database software vendor, where most of our uses of XFS outside of root file systems are really large data partitions where we would like to be able to perform at least simple repairs on the 100 terabyte data partition to try to keep the VM running, because at any given time, the thing that's running on the VM or in the container or whatever is probably not accessing the entire 100 terabytes of data. So we have some opportunity to actually step in and fix the file system before the application software really notices. And yeah, if they hit it, if they hit the broken part of the file system shuts down, then yeah, we have to kill the container. But we would at least like to try to, you know, build, grow new engines on the airplane while it's flying in order to avoid having to do an emergency landing, because in our case, restoring a few hundred terabytes might actually take a noticeable amount of time to say nothing to the people who have, you know, 10 times that much data and restoring from backups is going to take them a really long time. They would much rather we fixed things and not just throw everything away. So one thing about the Kubernetes thing and to my knowledge, like my understanding all the containers stuff is usually in a higher level, right, like they unpack the tar balls. And hence, if you get a file system failure, and basically the only option is to shut everything down and not anything individualized. This is different with the model we try to pursue with system D work where services can be shipped in file system images, right, like so that you can actually if one of the file system images is bad, you can immediately direct like to a very small set of services that are affected by this and then shut that down. But so my question regarding this online file system check, like wouldn't it suffice that there would be a mount option that if a XFS file system has some error detected that we like why involve user space was then triggering an online file system check wouldn't it suffice to just tell right in advance the policy that if XFS detects some failure to meet it does an online file system check because it sounds a lot more robust to me than instead of expecting user space to come back and do this. I mean what user space can is better in than XFS itself is executing actions on something else shuts as shutting down relevant services. But just going back to the XFS file system and say oh please fix yourself. That's, that's just bullshit like you can just do that yourself with policy. Yeah, I mean I don't, I also don't really mind just writing an XFS team and that sits around waits for notifications and will and can immediately schedule online discs or even wait for mount notifications and do it then. But I thought I needed to do at least enough. Casting around for information before I actually just decided that that's the way we're going to go. As it is to touch on something that Kent was asking about we, we do actually have a barely comprehensive XFS buzz test suite where we use the the abilities of the XFS debugger to walk every single field of every metadata object and the entire file system and buzz them so. Part of the reason why the XFS QA test suite takes almost a week to run is that we are. A lot of that time is spent what letting the fuzzing tests walk through every single metadata object to go see does the repair tool actually detect if this thing has been changed underneath underneath us without anybody noticing. And can it actually fix it whether it's online or offline like. I don't know that you to Fisk actually has that capability. I mean I think I kind of dimly recall years ago I did add some ability to FS tests to actually buzz. I don't know if we can do more metadata blocks and see if you to fiscal actually at least detect obvious corruptions, but I don't know that. I don't think I ever got to the level of precision that XFS has where we can do things like change and I know, change a directory entry I know number by one and see if, if repair actually notices that. I mean it's past four so I should probably yield the Kent, but so thanks everybody for showing up and giving me myriad inputs.