 Hello everyone. Welcome to the session. My name is Christian Browner. I work at Canonical and I'm a maintainer of the Lexi, Lexi, LexiFS and Relative project together with Stefan, with the project leaders for both of them, for all of these projects. But I also work a lot on the on the upstream kernel and maintain various pieces in there. And I have a pretty solid focus on containers and specifically on privileged containers, both in the kernel and in user space, but also on the VFS and process management in the recent years. And today I want to talk about the new mount API that has made it into the kernel over the last two years, which I found to be pretty exciting and that has the potential to make a lot of the stuff that we have to do in user space with containers currently in a clunky way because of the new old mount API in a much simpler way and also has great potential for extensibility to implement newer features in the future. But today I'm so the plan I have is roughly I want to go through, I want to introduce the concept of a mount. Well, basically, in a rough sense, introduce the concept of a mount and what different aspects this has this involves. And then I want to go over the old mount API and possibly and also mention a few of its limitations and then introduce the new mount API step by step by taking a look at the various CIS calls involved. But I also want to be quite, I hope to be quite demo heavy, I need to see how we'll do on time. I'm not a fan of rerecording the same talk over and over and over again. So we'll see what we do on time, you get the real real life experience here. So if switches to the new slide, mounting. So mounting comes in different flavors in the Linux kernel. The thing that most people associate with mounts is mounting a proper file system like an external hard disk or a USB stick. And that usually involves the creation of what the kernel calls a superblock, which is a representation for this new file system object. But the Linux kernel also allows you to create what we call bind mounts. That means, for example, you can take an existing directory located on your file system or a file and make it available at a different location. You can also bind mount an already existing mount point somewhere else. So this is quite flexible. And bind mounts are very important for containers. Because we need to, for example, when we share devices between the host and the container that are necessary for the container to work, we usually bind mount def null and def zero into the container from the host step. So we bind mount single files actually into the container. And bind mounts and the superblock can have different properties. It's exciting. So a bind mount, for example, can be made read only while the superblock is actually read right. But the other way it doesn't work. So if the superblock is read only, that will turn all the bind mounts of the same superblock read only because superblock properties, essentially, if they are overwriteable, can't be, if the superblock sets a specific property, such as read only, that can also be set by bind mount, then usually the more restrictive property of the superblock will override all of the bind mount properties, which is the same choice. If you make a superblock, which represents exactly like a representation of the hard disk or USB device for your operating system, and you turn that read only, you don't want it, you basically want to turn all of the mounts where this is visible read only. So this is quite consistent. And internally, mounting is usually centered around a struct that is called a VFS mount, and that holds all of the necessary information associated with a mount, including various mount options, such as mount flags, read only, no suede, and so on. Yeah. There's also mount namespaces, which basically regulate what kind of, if you get your own private mount table, this is what usually people associate with mount namespaces, so that mounts and human mounts in one mount namespace don't propagate, so show up in another mount namespace. And this is quite important for the container use case, because obviously a container, if you run another in this system in your container, it might mount or unmount various file or various file systems tempFS or slash temp, and you don't want it to affect the host. This is what mount namespaces are for. And then you also have mount propagation, which we'll not go into in any great detail today. And both of these things introduce quite a bit of complexity into the kernel, especially mount propagation, which is basically you can think of a bunch of tunnels between different mount namespaces. This is almost always how I envision it, but this is out of scope in the sense of time for this talk, but you can ask questions, of course, after this session. So the old mount API. Well, as you can see, or as most people probably know, the old mount API is based on a single syscall. I'm now ignoring the human syscalls that exist to actually unmount an already mounted file system, but it's not relevant. We're just concerned with creating new mounts here for this talk. Single syscall and it's used for a variety of operations. So if you, for example, you can create a new super block trying to mount it, for example, a real disk device or a USB stick, like I mentioned before, but it also can be used to create a bind mount of a directory file or another existing mount. It can be used to remount a super block or so a whole file system or amounts of changed amount options of amount point. So this makes the mount syscall, the old mount syscall, like it's actually just a multiplexer because they usually aren't a great idea multiplexes because it means that the syscall overloads different operations that should better be represented in separate syscalls. But it is what it is and the mount syscall has been around for a long time and people have lived with this for a long time, but obviously this doesn't mean we can't come up with something better in the future if should we need to redesign the mount API. And actually there were several reasons for that. So the API can be quite a bit difficult to use. It's full of quirks and legacy behavior. That's just what happens over time. One thing I want to point out as a quirk is making a read-only mount of an existing file system, existing bind mount or directory or so on is quite cumbersome. You cannot do it in one step. So if you see right here, I'm not sure if you can see my mouse, but we're on the first dash. Creating a new file system for an export file system, that is not a problem. Okay, we got this here. You create a new export mount for defsda at the location mount with the no suit flag. But below here, we create a new bind mount, right, for mount slash mount to slash temp and create a bind mount. If we were to specify MS-read-only, and let's assume we wanted to make this read-only, it's of no use to specify MS-bind in conjunction with MS-read-only. You would still end up with a read-write mount point, which is super unintuitive, right? So in actually to make sure that the mount is read-only, you then need to make sure, you then need to remount, to specify MS-bind, MS-read-mount, MS-read-only on the same mount point, and then you end up with a read-only bind mount. There is a weird quirk where when you specify MS-bind together with MS-REC, which means make me a recursive bind mount copying the whole mount tree at the location that I specified, and you specify the MS-read-only flag for that. Only the uppermost mount will be turned read-only. All of the other amounts will not be turned read-only. So in my example, what I actually did right here, and I'm sorry for that, the first example should have read MS-bind, MS-REC. So the fact that you only turn the uppermost mount of your mount tree that you clone read-only and not the whole every mount under the mount is quite a problem and has led to CVEs in user space. But there is no nice way to fix this because changing that behavior would break user space. So the only option to fix this would be to introduce a new flag, for example, MS-REC-read-only, but then you end up with a flag that combines two different sematical things, right? It combines making read-only with the recursive property and that's just weird. So for example, this is one obvious limitation of the old mount API. And then you're also getting into the territory where you're slowly running out of flags, I think. Also, the API, as you might have noticed, based on the arguments is purely path-based. So source and target need to be absolute, well, they don't need to have absolute paths, but they need to be path-based. So it's not file descriptor-based, which is problematic if you want to do delegated mounting or, for example, share a file descriptor with another task that all that won't work. And in the new mount API, that will actually work. So you see there is a lot of room for improvement, but as I said, the mount's call has served us well for a long time, no reason to force everyone to use the new mount API, but maybe I can convince some people that it's actually worth it. So let's switch to the new mount API. Well, the obvious cool thing is that it's instead of being path-based, something I just critiqued seconds earlier, the new mount API is file descriptor-based. In fact, you can use the new mount API without using any paths at all, which is obviously always excellent for security. And instead of having a single syscall to do all of the things at the same time, the new mount API is split into multiple syscalls. So it's split superplug creation and modification from bind mount creation, and making a mount visible in the file system is also a separate operation, which is great, because now you can have what is called anonymous or detached mounts. And since it's FD-based, this means it's possible to create a detached mount of a directory, for example, or an existing of a directory, and then not really mount this file system. So there is no representation for this mount that you just created in the file system. It's not reachable by traversing via the terminal, for example. That's not possible. So you can have private mounts per process, and the process can still use DFD to traverse the new file system to mount and open and create files and so on. So that's a nice property that the new mount API supports. So let's look at the individual syscalls that comprise this new mount API. There will likely be another system call in the future. Because it's not fully complete, to some extent. So this FS open syscall creates a new file system context. And the file system context is basically just in kernel state, if you want to think of it like this. And this file system context can then be used later on to be configured with an additional syscalls. Let's look a bit at source code while we're doing this. This is actually something I wanted to do. So I'm stopped to share this now, and I'm going to present and share another window with you where we are in the terminal. And I hope I made this big enough. So let's look at FS open. This is where the FS open syscall for example lives. So this is where you create a new file system context in the kernel. We won't get into too many details. And you need to be nscapable with respect to the owning username space of your current mount namespace. It means you need to have the capsis admin capability in the mount namespace in the username space of the mount namespace. This is a restriction. It supports the FS open call exec flag which turns the file descriptor. It returns in close on exec, which is always a great advice to use that by default. And it returns a file descriptor for the file system context it just created. So let's say you wanted to create an x4 file system you would call FS open for example x4 and then FS open clovecsec. This would be a system called to create a new file system context. That file system context has no meaning. So it's not a mount. You can't do anything with this. It's basically just a representation for in kernel state or the in kernel FS context. So this is what this system call is essentially is all about. It gives you a handle on in kernel state. Let's switch back to the presentation and whoops. So this FS open context is an anonymous I know file descriptor. Most file descriptors in the mount API that do not represent actual file system objects. So files or directories in the file system and anonymous I know it's basically are a bunch of file descriptors that all share the same I know because they don't need a full I know they just represent think of it as representing some form of in kernel state or an in kernel object. In our case it's a context that is kept track of. So the kernel now is a context for an x4 file system and is waiting for you to do something with it essentially closely related to the FS open system call is the FS pick system call which lets you create a file descriptor for an already existing super block. So this must be this must be a yeah an FS context that has that already existed essentially gives you a new fd for it. It follows the open ed pattern of opening a path so meaning you have a directory file descriptor this is the first argument or a path argument and they can be interpreted in relation to each other. So if the file descriptor refers to a directory and you specify a non absolute path then this path will be resolved relative to the directory that the file descriptor specifies. If it's an absolute path the file descriptor is ignored and if the special at fd cvd cwd value is passed for the directory file descriptor then the path is taken to be relative to the current working directory of the calling process and FS pick also supports a range of flags so it also allows you to make the file descriptor the file descriptor close on exec. It allows you to specify that you don't want you don't want trading sim links to be followed and no auto mount allows you to specify that you don't want to trigger auto mounts during path lookup because usually the fault is if you look up paths then auto mounts will be will be triggered and FS pick empty path specifies that the operation for this to create a new context for an exist FS context for an existing superblock will be performed directly on the directory file descriptor so the directory file descriptor must refer to the mount root of a file system so the root mount of a file system and both file descriptors give you an FS context file descriptor for a in kernel state back and okay now you have this fd that's great um this file descriptor can now be used to configure the FS context that it refers to so FS config allows you to configure set a file system various specific file system file system options um you see you have the FS fd argument which takes the file descriptor that the FS open and FS pick system calls give you and the unsigned in command so the second argument uh takes a bunch of the flags you see here on the screen and then based on this on the flag uh you're on the sorry on the command you're passing the key value and aux arguments are are used and which ones are used depends on the command so for example if you want to set a flag on the FS FS context object in the kernel then you pass the value through the key argument so you specify FS config set flag in the command argument and then specify the argument the actual value that they want to pass in the key argument you can set a string parameter so for example the source of amount via the FS config set set string argument um and so the for example if you want to set the source then you specify source in the key argument and you then you specify the path in the value argument the binary command lets you set binary arguments through the aux argument and the FS config set path argument and the FS config set path empty arguments are basically what you would expect from so give you open at semantics even with this is called so if you specify path then you can pass a df df file descriptor through the aux argument um and the value argument will be interpreted as a path or the key argument um and if you specify uh empty set fd empty path empty then the aux argument will be used to be directly the path will be directly looked at based on the aux argument and the last ones are the more for our talk today are the really interesting ones namely FS config command create and FS config um configure reconfigure FS config command create finishes finalizes the creation of a FS context so after this you cannot reconfigure you cannot configure it anymore with the FS config system call and the superblock is actually created um in the kernel and this requires the caller either to be capsis admin in the current username space if the file system is mountable inside of username spaces for example proc or to be capsis admin in the initial username space if it is not mountable inside username spaces um and the FS config command reconfigure argument actually reconfigures finalizes the reconfiguration of of a superblock so a file descriptor context that you have gotten via the FS pick system call we talked about we talked about um before so this is FS open FS config and FS pick are concerned with creating superblocks for file systems essentially and finally FS mount system call this is what we have been waiting for because this is the point where we um turn an FS context file descriptor into an actual mount so in order to get here we need to have called FS mount or FS mount and FS config and we must have called FS config command create these are the three steps we will see this later on in the demo and because then you have basically said i'm done configuring this file system context i now am ready to turn this into a usable into a usable mount file descriptor so you pass in the FS context file descriptor you received and then you can specify in the flags argument FS mount call exec again so that the file descriptor that you get from FS mount will be closed on exec by default and then in the MS flags argument you can set properties on the mount that you are now creating so for example you can turn the mount read only as you see here mount adder read only no c suits or no set uid binaries on this mount no devices on this mount uh no execution of binaries on this mount and then you can also set um uh what time options what access time options um you want and then FS mount gives you a file descriptor back and um let's take the time to quickly look at the system call in the source code as well the other ones are interesting as well but um this would take way too much time if we were to do this so you should have hopefully um have seen this give me a second and FS mount we need to switch to a file called namespace.c FS mount um oops sorry aha there we are so this is the syscall uh FS mount syscall um it's the same restriction may mount what it does it checks that you are the uh capsis admin in the username space of your current mount namespace and so it does then it checks that all of the flags are valid and so on it's only valid flags are passed and here you see um here the magic essentially happens it takes this FS context file descriptor and uh it verifies that it's actually an FS context file descriptor you see it right here this is FS context uh f ops so it verifies that the file operations that belong to this file are actually uh of FS context of the FS context type so any other file descriptors will be rejected and then it will verify that you're on the correct face so that essentially you uh uh that you have finished the FS config uh the FS config call um and then it creates a new mount the FS uh it creates a new mount and uh it allocates a new mount namespace see it right here sets this all up it opens the uh the uh path uh for this new mount uh and then it allocates in your file descriptor and it returns it to you so this whole system call actually turns the FS of the of FS context file descriptor into an actual mount so from this point on uh you can do something with this you can open files and so on with this so this is uh and this is a point to stress let me switch back to the slides real quick and there we go uh this is important to notice after this system call returns your file descriptor you can interact with the file system that you just mounted so you can open files in there by passing this to the open add system call and then passing paths alongside it you can create new files in there you can actually operate on this thing without it having any representation in the file system um which is great something which you couldn't easily do with the old mount api with the old mount api you would have to mount it get a file descriptor for it and then unmount it and then you get sort of the similar thing but you still have to have it visible at some point in the file system with the old mount api with the new mount api if you call the fs mount there is no representation of the thing actually in uh in the file system yet a very cool system call which i want to go to and uh go uh touch on briefly is the open tree system call which is the system call with which you create actual bind mount something which i said was muddled together with creating um a new superblock for a file system in the old mount api it's the same thing as with fs pick um you can pass a file descriptor argument that can refer to a directory and then you can also pass a path alongside it if it's a relative path it will be resolved relative to the df file descriptor that you passed in um and if uh you specify for example if you specify um um at empty path then you can operate directly on the directory under directory file descriptor um it takes a couple flags open tree clove exec which obviously just turns the file descriptor that is returned from open tree into a close on exec file descriptor um if you don't specify open tree clone you get a file descriptor for an existing for an existing uh uh and it is an existing mount then it not no copy is created but uh you uh get a file descriptor for this mount um and you can reconfigure it so this is basically remounting a file system and open remounting a bind mount sorry and open tree clone this is really cool open tree clone clones amount and creates you a new detached mount and the directory that you're specifying or the file that you're specifying doesn't need to be a mount in itself so it basically turns a directory or files into mounts this is what open tree does which is amazing and at recursive if you specify alongside it with open tree clone it copies the whole mount tree so every mount uh under a given directory file descriptor will be referred to by this file descriptor that open tree uh gives back to you so this is uh this is really great this is how you create bind mounts now um and detached bind mounts the same principle as with the FS mount system call so if you um it isn't detached to the file system right so for example if you do open tree uh on a on an existing directory with open tree clone at recursive and then somebody unmounts the file system you have your own private mount you can change uh you can change things on it you can create flags you can create new files and so on um which is pretty amazing sorry you can't change flags this is something i'm working on but um you can create files and traverse through the mount that you just created i'm making heavy heavy use of this in lexity and lexia already pretty good and last but not least it looks scary it looks complicated but it's actually not that complicated is the move mount system call this is the system call that takes either um a file descriptor referring to um referring to an FS context or to a file descriptor gotten from open tree um and attaches it to the file system so it makes it visible in the file system and you have a variety of options that you can specify so from dfd from path that's the same principle as with fs pick this is the source directory and into dfd and uh two path and this is the target directory and the resolve rules are exactly the same it's just that the source and target both can be passed uh can be passed in this style and you have a bunch of ms ms flags that regulate how uh how dfd these both directory file descriptors and these both paths are supposed to be interpreted so if you if they're supposed to uh follow sim links if they're supposed to follow auto trigger automons if you if the it's supposed to be operated directly on the file descriptor itself without actually resolving the path this is the empty path argument um and this is duplicated obviously so f sim link supplies to from dfd and t sim link supplies to do dfd um yeah so this is uh pretty pretty great um so this lets you do a variety of things you can move mounts from one place to another you can attach fs context file descriptors um attach mount file descriptors gotten from fs mount into open tree uh that have no representation in the file system yet um to specific paths uh in the file system uh so that's pretty great um and uh this is uh basically it um i try to be a bit faster here because i actually want to have sufficient time to do some demos of how this works because it's pretty theoretical and what you really want is to program with this stuff um so i'm going to set up the demos uh and uh then i'll be right back and uh we'll continue uh that's at least what i suggest so see you in a little bit so let's do some demos um first of all i'm going to start presenting here and um share my terminal with you this one um hopefully this is large enough for everyone to see um zoomed in quite a bit uh zoomed in quite a bit so um probably so first of all i need to go into demo parts i wanted to do and i'll treat it as this is probably the first thing that we want to do let's just wipe this and then start from scratch um so i prepared a little header missing syscall it's that defines all of the missing syscalls and syscall numbers um they're mostly the same on all architectures for new syscalls apart from a few odd ones such as alpha mips and ii 64 um but i've taken care to define all of the syscalls we care about open tree move mount fs open fs config um fs mount fs pick um the mount attributes that we can specify with fs mount um and the various flags for the individual sys to call and then simple reference for all of the system calls themselves fs open fs mount um fs config move mount fs pick and open tree um and also a little help just to basically dot that we can easily exit with an error message when um uh when an unrecoverable error should happen so this is pretty it's pretty straightforward pretty simple we're using c for this because it's close to the operating system and fairly easy to read so uh first of all what you want to do first example you want to mount a real file system it's pretty straightforward but what you would usually do is um uh mount um in this case this is uh depth loop 10 i think and then slash mount um and ms read only zero this should cover it but this is x4 this is the mount is called that we essentially want to translate so um we have a real file system that uh i set up already not x4 but x4 um depth loop 10 um so i mount it on a loop device and what you want to do is we want to mount this so usually what you would do is in red mount uh should uh pretty small zero and i i don't know failed to mount x4 file system and i prepared my header correctly uh ha so um standard mounts is called i do uh then i see we'll have mounted a file system of x4 at slash mount um i can you mount this and now we want to translate this into the new mount api so this thing then becomes first we want to create an fs context like i mentioned before um x4 make the file descriptor close on exec now we want to die failed to create x4 context and then we need to do uh configure the context actually that we just created um by setting the source so uh source and the source in this case is as i pointed out before is loop 10 and no flags and if we get an unexpected return value we just do die erno and for this case let's be a bit simpler to go quicker let's just do we fail that fs config now let's say we we are finished uh configuring our file system context now we want to finalize it and actually create the mount to do this well uh to create superblock um to do this we call fs command create um this becomes now this becomes now um let's say call this create and let's call this uh set source and then we finally create the actual sorry now we're creating the actual superblock of causing the kernel to allocate superblock only i'm telling you nonsense fs config is where we create the superblock we're now actually creating a mount um fd smaller zero we can say die erno with a with fs mount and now finally we have a detached mount right now we still haven't this anywhere visible in the file system so if we were to compile this and then do sudo and then would do find mount then you would see it's not mounted anything is not mounted anywhere so we're missing a large crucial step and this is red move mount in this case mount fd and we're passing an invalid value because we're not mounting based off file descriptor in this case so we're not moving based off my file descriptor want to move to slash mount and we want to specify that for the source we're moving mount uh we're moving empty path so we're operating directly on the mount fd file descriptor then smaller zero and we say erno um and we fail with move mount now compile and let's run and uh we'll see cool we just created an x4 mount so this is what it would look like to create um to mount a file system in your mount api usually i mean this is a lot more sys calls but it's it's cleaner it's clearer and also usually you configure a lot more stuff when you set up a super block and if if you if you don't want to be fast with all of that i mean you can still use the old mount system call but this is sort of the future and i think it's actually quite nice but let's say you want to create a bind mount so let's create something that we call slash bind mount actually let's go uh bind mount recursive in this case let's call it just like that so and then being lazy and copying a header and copying the bottom and let's say so what we want to do um the first of all variables we need and we say because we're operating i'm going to operate on my current Budapest i'm going to create a new mount namespace and uh this mount namespace will make sure that i'm not going to alter the my current mount name table and basically crash my computer this requires us to include sketch by the way um and now we do uh red mount uh let's use the old mount API for that again um cms rec ms private to mount my turn my root of s into a non-shared private mount so that nothing propagates out into and out of my mount namespace say we're dying right here with mount and then let's do so now we're calling open tree which i've introduced before and with we say uh oops uh minus ebat f because we're not operating based on a file descriptor here and we're opening our root directory and we're specifying open tree claw exec make the file descriptor close and exec open tree clone because we want to create a new detached mount and add uh recursive because we want to copy every mount of our mount tree and also add empty path um see that's nonsense don't need that empty path um okay and now we say if the mount of d we get back a smaller zero we can say die erno uh open tree in this case and i should learn how to type okay and now finally again we move this mount into place say mount fd again minus ebat f not operating based on a file descriptor in this case and then move mount empty path to specify that we are operating directly on the mount file descriptor right here for the source die erno and we can say again move now and what we would expect to happen is that we recursively mount all of our complete root of s to a location on mount actually there is one step missing because mount namespaces um what is he complaining about uh what is he complaining about mount make and mount ah okay well that makes sense um okay um because we've created a new mount namespace and the mount namespaces only exist as long as they're either bind mount or a reference by the process um that has created it um we're going to do uh run a new shell and and we're going to say char star um and then you can see so we want to see our whole root of s appear in slash mount see and then let's do pseudo dot slash bind mount if i'm going away because i fuck misprogrammed something um great talk thank you for being here um okay and now let's see oh yes everything the whole root of s is appears under slash mnt again so we just created we created a bind mount can log out the context is the bind mount is automatically destroyed and as you can see um these are two examples of how the new mount api worked should be it's it's fairly straightforward once you have gotten used to it to not like tie this all into one single syscall but rather stretch this over um over multiple syscalls but it is really way nicer you create amounts and the creation of the mount is independent of the appearance within the file system and the move mount system call is where you actually attach amount into the file system which is just such a great great tool but um in any case i hope you learned something and you saw how the new mount api is not just structured but also how it can be used in programs um i'm probably going to make the demos available somewhere probably on my home page after um after i've given this talk and uh that's basically um that's basically it we're uh we're almost out of uh we're almost out of time and so there should be instructions on how you can ask me questions and i'm excited i'm excited for that so see you bye