 Sorry ominous title, but it's it's and it's not Sort of a visionary talk is just three three brief items and one of those items you should already be familiar with Stuff that I've been thinking about. I think the first one I haven't talked about before so In container runtimes, okay if I stand here in container runtimes We have this issue of using loop devices a lot Loop devices you can change the media that is attached to them like images and so on so you could be trying to mount a loop device image a Loop device and while you're doing so the image changes and there is no easy way to detect this was a long-standing problem for quite a while and then Christoph came up with this idea of using disk sequence numbers or introducing the concept of a disk sequence number So all disk devices have a monotonically increasing sequence number 64 bit integer Which can be used to detect media changes also goes for USB sticks that you unplug and Into your computer again and These disk devices can be queried using the block gate disk sec I octal and so user space has a way of finding out when Media has changed or the media attached to a loop device for example has Changed and nowadays system D will in addition to a bunch of other stuff such as like Diff slash disk slash By UUID It will also have by disk sec So you can reference disk devices through these sim links by the disk sequence number So this already eliminates a bunch of races, but not all of them. So As far as I understand So you could for example still try to specify a block device or in the new mount API You could try to set the fs config the source property and the source property would be a string slash def sd a1 or whatever Def loop one and that loop device gets resolved you're mounting it But in between the media attached to the device has changed. So you'd be mounting the wrong thing and this is a worry for Unprivileged doing unprivileged mounts for example or doing unprivileged mounts for in the year of a container And so the idea that I had pitched off list to To Christoph before was to introduce a new generic property In the new mount API which I called a working title source disk sequence Number to source disk seek so what you could do in addition to the source property You could also specify query the disk sequence number for the specific device that you want to mount you set it With the new mount API and once the file system actually comes to looking up the block device That is supposed to be mounted it would be able to detect hey This is no longer the block device that we actually care about the disk sequence number has changed go away the I find this to be fairly uncontroversial, but I wanted to input from all of you There is some work associated with this because I think it would mean that most file systems that are Block backed which aren't haven't been ported to the new mount API would have to be ported to the new mount API for this To work. I'm fine with doing this work. I don't mind. I think there's also probably a lot of patches that already exist in this area And so then once this is done, we would need a helper in in the block layer that That would be able to compare disk sequence numbers stored in the block device and that's about it So for old file systems, couldn't you also do an FCNTL which would mean you could Queer this through either source and that would mean they wouldn't be okay. I know everything needs porting, but it's sorry I have really bad hearing James. I use an File system control instead of as well as the FS config and what would the file system control do exactly? Basically some sort of thing to either query the disk sequence and you have to compare or You could actually set it as a you do the FC you do a control first thing You do the mount and it fails if it doesn't match even with the old mount API One specific file system that we know about who else needs to be Move to the new mount API. I haven't looked to be honest. I just assumed that not all of them have been converted Yeah, I'm gonna do it next week I mean, I know we haven't I it's it's not a it's not a problem Like if you don't have time to work on is I can probably carve out some time and do this thing No, I've had requests from other people like the fedora guys are switching their stuff to use the new mount API I'm like, hey, but if I listen to this, I was like So I'll get it done, I promise But if like I'll just go look and see because if it's relatively straightforward Let's just convert everybody as quickly as possible and just do this. Yeah Yeah, so any more so and the other thing you've probably already. Oh, sorry Now I'm suddenly regretting inviting you They just wanted to mention that of course, it's great if during mount we checked the disc sec But at least in theory it would probably be wise to also check that while the thing is mounted right like so if you have a Butterfus file system mounted and some day later the media change changes ideally butterfus Like the disc sec changes ideally butterfus would just say okay the process is now invalidated in some way So it's it's I think it's important like the most interesting to have it at during the actual mount thing But actually probably to perfectly close the security thing would have to Yeah, there has been discussion recently about how it would be really nice And I think Kristoff has actually sent patches down by I think of it so that the block layer can inform the file system When media has been ejected so that we can just simply shut the files down completely, right so Because I think the problem is it's not just that you know the media switches out from under you We want to shut it down as soon as the old media is removed, right? And I don't even care about what happens when you insert the new media I want the file system gone when it's ejected, but we do this anyway if you eject a block device It goes down to being a zero-size device when we re-scan it it comes back as a different device now file systems Don't like the sudden change, but it does mean you can't corrupt the device Right yeah, I think the issue is Because the file system isn't informed we essentially have to check all the time or People report the kernel crashes because the media was ejected the file system You know didn't know that the media had been ejected and it gets an error or De-references a pointer that was freed behind the file systems back when the when the media was ejected So yeah that this is why we want a callback as opposed to it just happening and expecting that we will deal with it But yeah, okay, so Many have seen this patch set or have followed this patch that the originating inspiring cause for this was the ability to call it upgrading mounts or replacing mounts and We thought about several ways of doing this one of the Very tricky requirements of this work is that it needs to work with mount propagation Because the way systems are set up well, sorry the way system these services are usually used on most modern system is that you have propagation set up between the system services and the host system and If you want to change amount across all services, then you need to be able to somehow update that mount in all services and the only way this can work in the current With the current VFS layout is you with mount propagation and The problem is if you if you update for example slash user You put a first mount on slash user it propagates into all the services all the services see a new updated user You do this the second time you have the problem that you now have amount Beneath that mount if you do it another time then you have another mount on top of it And so on so you have an endless stack of mounts And if you think about running a thousand services and you upgraded it five times Then each of those services have a stack of five separate mounts in the kernel that are duplicated If you unmount it first and then mount again You have the problem that you expose the underlying mount point in between so the service might see old Data so we try to think about ways on how to achieve this and the easiest way implementation wise So that you don't run into Additional mount complexities was by allowing to mount a mount beneath another mount and it is really not very That there's really not much more to it So the way that we do it I think I briefly talked about this yesterday is if you have an FD open to a specific file in this case a mount point you would have an entry and amount and That path usually doesn't change but if you go into the mount code you call a function called lock mount which acquires and drops namespace lock and then Walks up the mount stack So if you have something mounted on top of that specific mount point So let's say slash opt and on top of slash opt you have a new slash temp mount and on top of it You have a source mount and you have a file script or to the slash op mount What the kernel does is lock mount. Okay. There's something mounted on slash opt This is slash temp now. I'm at the slash temp mount acquiring the lock again walking up getting the slash SRV mount And that's my top most mount. Okay. I'm done and then namespace lock is held You can't over mount anymore and then you stack the new mount on top of this mount correct me if I'm If I'm wrong so what this just does is It walks up to the top most mount and then it shoves a new mount under the top most mount so at that point You can unmount the top most mount and you reveal the underlying mount point There are two it's basically a way to replace mounts without actually replacing the mounts because there are several complexities involved in this so if you really were to replace a mount you would run into issues that one of the mounts could have child mounts mounted on dentries of its parent mount and If you shove a new mount want to replace that mount The child mounts are not guaranteed to have mount points on that new mount that you're trying to shove under anymore So that really doesn't work the other thing is if you call You mount afterwards you are holding potentially holding names namespace lock for a very long time because of mount propagation You first propagate a bunch of mounts and then you also you mount propagate again So it's really not nice to do it this way with the ability to shove a mount believe in existing mount You just need to do the mount propagation once and then it's up to user space to upgrade to the new mount by calling you mount and then Mount propagation will reveal this There's a okay Ted go ahead. Yeah. Um, how does this interact with overlay FS? Overlay FS is just there shouldn't be a problem overlay FS is just so when overlay FS is mounted It clones the underlying mounts of the lower layer. So it's a private mount stack They don't appear anywhere and the overlay of mount is just a single separate mount. So it's just a regular mount point It shouldn't be a problem So you would shove another overlay FS mount on top of another overlay FS mount unmounted and then you update to the underlying overlay FS Mount so this shouldn't be a problem the more intricate parts are Maybe I can briefly illustrate this This is just the kernel that has these patches, but I want to illustrate what one of the problems with mount propagation is So you can see you can see there is a single op mount now if I mount on top of this op mount Let's say 10 times How many mounts do we have? We have like thousands or it's really you can't if you type find mount. It's going to be very ugly So the the problem really is That uh, if you have a situation where the parent mount and the child mount that is mounted on top of the parent mount Are in the same peer group Then they propagate to each other which means if you propagate a new mount you first of all Mount it So the way this internally works in a tetracursive mount is you first propagate So you mount the mount on top of the slash op mount copy and then you do the actual source mount mounting Beneath the mount that you just propagated so then you have a mount stack So and you can see this grows almost exponentially like the more mount points to have the the more you The more mounts you are creating due to mount propagation. I I'm really not sure if these semantics were intended or if they're just Ended up there by accident, but I wanted to avoid this with the patch set i'm working on Would it make more sense to have a just a swapped mount is a mount swapping for that one Um, yeah, I thought about this. This is like the replace mounting, but then you have to have certain restrictions This is what I tried to say. I'm earlier this way By mounting a move amount a moving amount beneath another one The mount that you're mounting beneath can have child mounts And uh, it doesn't really matter But if you replace a mount then the mount that you are updating and so the mount that you're replacing the mount that you're trying to Replace with needs to be the same mount so that all of the child mounts are guaranteed to have mount points On the new mount So when you said it's the top mount it's not necessarily the top most mount because there may be yet more mounts on top of it sorry top is uh archeological top not child Understand So you can have a slash op mount and on that slash op mount like let's say you have slash a slash b slash c You have a slash a is your mount that is mounted on top of the u-root as file system and then on slash b You have a child mount mounted. So on slash a slash b. There is another mount If I now take slash d and want to shove it Replace it so to speak the slash abc slash a mount The mount that is mounted on top slash b Doesn't have a Mount point on the new mount anymore. So I would need to get rid of it And that is potentially problematic if you don't want to do that you want to for example Wait until you have unmounted specific child mounts before you actually upgrade to the new mount The top and beneath are referring to a single directory. I know the stack on a single directory. I know then the child's are children Understand the directories So this mechanism is just also a little more flexible like the replace the replace logic would Require you You always want to replace that mount but sometimes you might just want to mount it beneath Let the service do its work and then unmount it upgrade to the new mount it just sounds It just sounds like There are going to be problems somewhere because you're changing the middle of the mount tree Is Isn't it Yeah, but you show something when you say you've mounted beneath So this is like a concept that exists today like I can uh illustrate this For example, if you have mount propagation, right? Do you want to this is a bit slow? So let's say It should be automatically Here you have a temp mount mounted. Can you hold this for a second? Here you have a temp mount mounted on an op mount in that mount namespace that mount namesp- well The root of that mount namespace slash Is a slave mount to the host mount namespace So here you have an op mount And here you can see that you have the opt mount which is now the parent mount of the temp mount that has been Mounted here before So the mount is beneath that mount that already exists yes, but There can be things that assume they know the look at Say prok pit fd and the path Changes because now you've got I mean it sounds like a very I wouldn't know why this would be a problem because as I said this this can happen Already today like you they would need to be dealing with this already Mounts can appear beneath your current mount A couple of kernels ago it would be even worse because you would have shadow mounts So then in this case in this case you would have This would look differently Think about this second op mount being moved one layer to the right and these two mounts would shadow each other And you wouldn't need to figure out that the lowest mount in this case is the one whose contents you're seeing So You know what i'm saying In the mount hash table in the mount hash table you insert you insert based on uh parent and Dentry right and then you look up the child mounts of this specific uh parent mount and you can have You still can't have today l pointed this out to me in very specific. I would say pathological scenarios Um, you can have a sequence of mounts that shadow each other So they are mounted on the same dentry at the same parent It's not unique and Tucked mounts were away or tucked mounts in this case mounting beneath was a way to get rid of this problem By having a clear parent child relationship You could you always will have a clear child parent relationship in this scenario And by the way, what I showed you before this uh slash opt mount propagation problem in the same namespace Is exactly the same thing you're tucking mounts beneath the other one so Mount propagation had If this is a problem that ship has sailed 10 years ago All the book already exists So the the thing is um, I wanted to avoid this problem. I think you've got it's coming from our oh cool okay, uh If I may The pathological case is indeed pathological and uh It's not uh a good thing to have but uh This uh slide mount under the one we are going to replace Does not address Not a problem if you have uh something mounted on sub directors of that thing That won't get migrated only on your sequence Okay, you you you've managed to slide replacement under user Say you've had something on user local Now you are trying you have new user under the old one and uh user local mounted on uh Sub director local in that in the old file system Now you want to complete the transition you want to get rid of old user Uh So you unmounted If you do lazy unmount find that will work and that will expose the new one What user local is doing If you don't know lazy, uh Then it will just say busy You might try to move User local to In corresponding location on replacement file system But uh, I don't think that you want that to happen automatically by By your variant of mounts is cool And uh that And it's not convenient to do with your setup I mean uh You don't well you can't set it up before the natural way to do it would be to migrate somehow User local to corresponding location on the new one Then slide the whole thing under Old user and then drop lazy drop old user from from that thing but I I mean the problem I have with that is Basically usability of that for use on how inconvenient would be to do if The thing you are replacing Has something mounted on its sub director is not on its route Yeah, that's the director is that that's the the point that I was trying to make To david before that I don't think so basically what I would say in this case you just want lazy You mount If you want to upgrade to the underlying to the underlying mount point, uh, and if you have to sorry come on So basically my answer is I don't see this as a I don't see this as a problem. This is fine But if you do that, um, then you still have that window where user local is not visible Yeah, if you have sub mounts, so ideally you just replace, uh, you replace single most I think replacing full mount trees It might be doable It might be doable, but I don't know if you really want to go down that route uh well You mount the new start with mount in the new one somewhere reachable for you then recursively bind The sub trees mounted on the old one in the corresponding positions on the new And then slide the entire thing under the old You're replacing and uh lazy amount what you're with But you still have the window of opportunity where no you don't no you you start with okay, uh You start with same mountain new replacement for user on Someplace fine Now you bind User local on replacement user local And do the same for other file system mounted on on top of user User isn't and everything mounted on it is not disrupted at all It stays as is Then you slide the entire thing under user You move it under you move beneath the user and then your lazy amount user Yeah, so uh that would be possible with the if I understand correctly this would be possible in the scheme that I have here This is perfectly fine and currently we have uh, we have a limitation you could even do it I'm not sure because I haven't Thought in detail about this, but I think you could even do it with detached mounts currently you can Uh put mounts on anonymous mounts additional mounts on then anonymous mounts So you couldn't say open tree clone and then use another detached mount and mount it on a sub directory of that detached mount That currently doesn't work because the check mount check fails And there might be some specific reasons to it Please let me in if that's the case But if this is not this not an inherent problem, then you could even disassemble it You could even assemble it without having it visible in the file system at all at first There's a bit of a problem with it One uh, what if you mount if you have detached tree And you mount something right on top of its root And then you slide that stack of two file of two mounts One on top of on top of another Try to slide it Beneath something far as I can tell the code you've posted Will end up with precisely that pathological case The the shadow mounts. Yes. So the new yes the shoe new version that I posted what I'm doing is Once I've acquired name space lock in uh, can okay. I'll need to check it Haven't seen it sir. I no no no worries at all like I have path over mounted Which is a lookup mount under rcu lock and if it detects that something has been mounted on the from Code it rejects the mount It says go away your source has been over mounted in the meantime I think that's fine like this should be such a rare occasion that I We could probably also try and make it so that It walks up to the top most mount for the for the source mount But I'm not sure if that's actually the case like how often will you end up in a scenario where your source mount gets Over mounted Okay, uh, that's probably best best taken to email and rcu Taken a bunch of time already So one thing that I tried to to block with this. I'm here. Can I trouble you once more? Thank you so much um Is the the the case that I showed where you have this mount explosion thingy because of mount propagation um because That actually thanks to l that I thought about is a bit deeper and if you could check this it would be quite interesting um if The the parent mount that you're mounting on top and the mounts uh that you're mounting beneath are in the same Peer group so they propagate to its apparent propagates to the child mount on top Then we refuse to move a mount beneath exactly to avoid this mount explosion thingy That we have for uh the current case Might be but frankly it's uh, it sounds like a doctor hurts when I do it It sounds like what? Doctor it hurts when I do it With canonical answer don't do it then Yeah, but but it's really the thing is like people Yeah, I know but I find it really weird when you have when you request Mount on top of something currently the current semantics is mount on top of something and then instead of one mount you get Two and it is completely meaningless in my opinion to have first propagate a copy of the source mount on top of it Then take the source mount Mounted on top of the target mount and then remount the mount that you just propagated on top of the mount that you just mount Actually Quite often the use of bind mount director on top of itself Is this it's followed by and make it private Precisely to get it out of the over propagation uncertainty to get yourself a room where you can work Yeah, so the the only thing that I really changed is when the parent propagates to the the parent propagates to the uh To the child mount Is that you get I know that's literally all which is a friendly reminder for the user Make the mount that you're trying to mount beneath private I my thing is why should we repeat the same complexities that we already have in the mount propagation code? Yeah Consistency maybe but anyway I'm not Okay I have no strong preferences the moment I need to think about Okay, yes, that sounds awesome What I wanted to say here for example the way mount propagation works is also So if you propagate a mount, um, it's The mount that you're trying to propagate needs to be a sub directory The new mountain point needs to be a sub directory of the root of the mount that you're trying to propagate to I know this is really a mouthful, but uh, for example here the sv Sub directory isn't the sv dentary isn't the sub directory of the opt entry So in this case you are able to mount beneath even though the parent propagates to the child because The slash sv mount wouldn't be propagated on top of the opt mount before you actually mount on top of the op mount I'm really sorry. Okay One thing we'll definitely need to remember is that No way that thing can go in without Documentation because trying to Reconstructed by the code will be Unbearably hard I have, um, so I I shoot you not. Sorry. I Um No, no joke, but I have a file that is 1600 lines long that Explains all of the corner cases for myself The comment message is extremely long and I've added comments to all of the Helpers and specifically the Parent doesn't propagate to child and parent doesn't propagate to the source mount that you're trying to mount Anna Also has a long comment Trying to explain Why and that is blocked and yes, I also want to have documentation for move mount, but I have to say Someone had man pages for certain system calls, but they have not made it in yet Useful, uh, don't try to filter up send as it takes too much time Sorry, what? When you are posting your, uh, Notes of that sort Don't Try to filter out obscenities in my experience. It takes way too much time Nothing gets done Okay, I keep that in mind, uh, but uh, this is a this is a general problem for the For the mount api to be honest because when we did We have, um I've been a strong proponent of it in the sense that I tried to be Try to push this into various user space projects We have now util linux completely converted to the new mount api Also, um surface some smaller bucks System D uses the new mount api now almost Exclusively and the biggest problem is while the mount set other system call that I added is extensively documented Open tree move mount fs mount fs open and fs config aren't really documented So I spend a lot of time, uh, explaining to people how it works. No, it's not your fault In which part it is Yeah, they just need to be They just need to be they just need to be once we have that merged. Uh, that'll be a good thing and I've, uh, One of the nicest features that we have, uh, and that he's going to talk about tomorrow Is the ability to actually cleanly mount into mount namespaces and something which for example, even though I've, uh, Worked a lot with the new mount api. I only figured out a couple of months ago as like This actually works um So, uh, that's pretty good. I think this opens up a lot of Possibilities, uh, for us that are quite awesome. Um, and I didn't get to my third point, but it's also not that important to be honest Oh, right