 So, this is probably more focusing on user space than the previous talks. So yeah, I'm going to talk about mounting file system images in username spaces or basically actually any unprivileged user space basically. So the underlying problem that we try to address with this is that everybody wants to be able to mount disk images that contain arbitrary file systems in user space within mount namespaces without privileges and that's a complex topic to address. The major problem of course is that kernel file system people generally do not want to give these guarantees that a rogue file system image couldn't exploit the kernel. So the big problem is that we have to establish some trust into the file system image before we actually can do this and then somehow pass control to this so that they can actually attach it to their own file system tree inside of their namespaces. So what I'm going to talk about is basically some components we want to add to systemd, so with the ultimate goal that this is just available everywhere, at least everywhere where systemd runs, so that unprivileged user space, like unprivileged user code like for example container managers that do not run as rude or any other kind of user tool can just make use of this and mount the file systems into their containers without requiring suede, without requiring any persistent user ID assignments or anything like this. Yeah, specifically it's about code running in a user's name space can ask the host OS to mount a block file system of a regular file. This is the, I mean, yeah, this will require that basically every container can get access to some, in some limited form to some IPC interfaces that is available from the host. This is different than how containers previously work since generally they only talked the kernel API and maybe in some limited fashion was the immediate container manager. In this proposal we would suggest that they also get access to in a limited form to some IPC to the host OS. Yeah, briefly management use cases are container managers that also build tools that want to build images for containers, desktop application run times that also want to be able to run apps of disk images and basically any other tool that wants to deal with disk images, for example just to enumerate that, the contents of it. Yeah, the complexities are primarily establishing the trust in the integrity of the disk images so that they can't exploit the kernel. The focus is both in immutable images where we have certain things like the emberity that makes things easy but also writable images in some form. It's like the focus is also minimizing reaching over into calling namespaces basically. We would rather avoid that we have to from the host enter the namespace, do some work there and things like that and thankfully we can. And yeah, we want to allow a certain level of recursion, right, like so that you can have nested containers and still works in a reasonable way without any special cases or anything. So yeah, the general concept of what we have in mind, this was supposed to show up one by one but this is a PDF now so you now get the wall of text here and wall at once but the general idea is that you have an unproledged client process P which could be container manager or something like this that allocates a username space U without any mappings. And it talks to some IPC service that runs on the host provided by system B, passes that FD to the username space to it that service and assigns a transient UID range. That's one of the key ideas here basically that we do not do persistent UID range assignment anymore but transient ones. So that basically as long as that username space exists it has that UID range assigned once the username space goes away the UID range can be recycled for something else. This is systematically different how username space is currently used because there's always static allocation. Users get some stuff in Etsy sub UIDs or something assigned and this stays for them forever which doesn't really scale and generally a bad idea I think. So once they have now the username space with their transient user IDs this username space also comes with an implicit security policy applied so that anything that runs inside of that username space can only create files or do chmod and a couple of other things on a allow listed set of mounts and initially that list is empty but they basically cannot create anything anywhere. Then it talks to another service which actually is capable of mounting images. So it passes in a file descriptor to the image file it wants to mount plus a file descriptor to the username space that serves and mounts something and returns a file system that is a file descriptor to an FS mount that basically has the UID mapping already established that maps it back to the username space's mapping. So the net effect is basically yeah this is an automatically allow listed in the in the security policy that I mentioned so yeah so that basically can say I have my image here I want to be able to mount it then I get the new style file descriptor back and then they eventually just attach that to some location in the file system and everything is in order it always matches the transient user ID range that they got assigned and everybody's happy. Yeah and that's kind of all there is to the concept. Looks like a lot of steps but actually for client application it's really easy. Just do one IPC call to get the username space set up and do another IPC call to allow this certain mounts. This is basically the same thing that I just said yeah from the client view it's just allocates the username space, asks the service for the mapping then ask the other service for the mount and attach it wherever you want. By the way I put this in the strict order that you join the username space here at the end but that's not a requirement they can join in the middle of two. To be more specific I just called these two service X and Y but actually it's implemented in SystemD on SystemD user dbd and SystemD mount fsd. The reason I put this X and Y in there is because the concept of course are entirely generic. If people do not subscribe to SystemD, a view of the world they can of course always implement this on their own very easily. What's particularly interesting is like the security policies that I mentioned is implemented with BPFLSM because we basically need to be able to say somehow that yeah everything that runs inside of that username space has to go through that allow list of mounts. BPFLSM actually makes that possible with one limitation which I'll talk about later. The disk images that SystemD mount fsd implements actually follow this ddi spec which so they can actually because we need to have the dm variety in them and the file system so the way this works is that we basically say yeah give us a GPT disk image like a partition table have two petitions in it the variety partition and the actual file system plus actually a third one which contains the the the signature for the root level hash and then we'll go to the kernel give the kernel the root level hash was its signature the kernel verifies that against the kernel keyring we get the dm variety device set up we can mount it we apply the UID map and give it back to the client so by implementing this ddi spec it's kind of easy because then people don't have to give us a lot of different resources we just have one regular file which includes everything yeah the big problem is of course establishing trust I already mentioned that dm variety is big part of our solution there so yeah if you have a ddi image that contains file system and dm variety and the signature then we can just pass all of that together to the kernel kernel makes its decision and sets the things up for us in other like this is probably the primary way to establish trust but we can also add other stuff if kernel policy allows that like trust by location so then you have it just a trusted directory and if some image was replaced there it's implicitly trusted so yeah this is when some downloader can establish trust verify everything and then just put it there and then it's trusted it's of course a weaker model then the third one is writeable images dm variety is not going to work for that of course so the idea is that I mean the bottom part doesn't exist the rest actually does exist already in this implemented and we'll probably show up in the next system do you think but the idea with writeable images is basically would be that we use dm integrity so then we have a yeah we use it we would use it with an hmac hash function where the key for the hmac is actually maintained by the system and it's inaccessible to anyone else this basically allows us so that our little service can do mkfs with a sidecar file with the integrity data that nobody else can fake because they don't have this secret key that we feed into the hmac then users base can basically have its file system image and the sidecar image which contains the integrity data and they can anytime they want come back to us give us both things and we can mount it again for them but if they get rid of the sidecar then the image is basically yeah we can't mount this for them anymore this they're like the purpose of this right ability is basically that a container manager example that wants to build images can just go to us and say hey I want a disk image please set up an xd4 image or whatever else of this size and then they can just do this and as long as they keep the sidecar around they can mount it or an unmounted and mount it again as long as they want that question so let me just repeat this because the mic disappeared but anyway so the question was regarding where the assigned fs-varity files could be used as well I mean right now the only the the up up to policies are implemented but of course it's user space we can at whatever makes sense and if it's dm-varity or fs-varity it doesn't really matter as long as there's some way how like ultimately my goal is always to let the kernel aside for like that matches against the key ring instead of us doing trust enforcement user space that's always worse but yeah send us a patch it sounds pretty uncontroversial from outside yeah so one thing that I think you could do is if you had an image file and one of the things that your system d mount fs-demon could do is copy that file into a block device so that the unprivileged namespace doesn't have access to the block device and then run fsck on it so that you know the file system is consistent because a lot of the security concerns were about a maliciously you know modified file system that would trigger a kernel bug and so if it passes fsck and then you mount it no su to id no dev blah blah blah you might be able to give the container a writable image without having to use dm integrity right just a thought interest we actually do call fsck on it in the writable case already so this basically is already given but it was noosed it was noosed to me that this is a philosophy that file system engineers subscribe to this that an FS check can establish trust I mean that's a shaking the head yeah it's not a universally yes so it's going to depend on the file system and for xd4 for xd4 you would say FS check is enough to establish trust yes I I am I am reasonably confident that it is not possible to maliciously corrupt a file system such that you could cause a buffer overrun or some other kind of privilege escalation attack in the kernel yeah what about algorithmical problems right like that they modify the data structures enough so that they are still valid but runtime performance is just terrible so I don't know this might be yeah I mean you could you can yeah you might be able to get something like you could heavily fragment the file system such that access to that file system would be slow right but that's not a privilege escalation attack so this sort of depends on what your threat model is and so yeah obviously you know this is a policy configuration question but if you have a sufficiently paranoid fsck then yes you can make that now you could imagine someone being so paranoid they don't even want to allow that right but I'd actually be pretty confident about that very good to hear but sounds like like something we actually should implement them though with an allow list of file systems I guess overly pessimistic right like I I think that you know do I trust that our fsck is going to give you like a valid file system yes of course I do that but like I also understand that I am not perfect and that there's likely to be something that's is messed up in the kernel that the FS check doesn't necessarily capture or vice versa right so like I mean I think it's a good solution right but like when you're talking about like you know high-secure environment like I wouldn't use it as the source of truth right I'm sure it's not a high priority for you but have you thought about network file systems here but allowing the user allowing unprivileged users to allow to mount a network file system I think that in particular we just added TLS support to but that sounds much less risky doesn't it like it basically I mean one thing that we certainly want to add is that one thing we certainly want to to enable is that you can pass into the service a file descriptor system arbitrary directory and we will then generate a bind mount apply you ID mapping to it and give it to you right in that case that is relatively riskless because I mean it's already mounted right so isn't this network file system thing something like that like that do any of the network file systems you implement you ID mapping right now not sure how well NFS you ID mapping yes but that's a totally separate concept the one that you are interested in no but I have I had posted patches for CEPFS and it worked fine there it was just that the server implements restrictions based on the callers fs you ID and so there would need to be a few more modifications or more I know the operations would need to be made aware of these ID mappings but the patch set works and someone constantly keeps pinging me about it because they keep running it in production so that you mean a malicious server is just as bad as any malicious disk image so we do want to probably have some restrictions there but in particular we just added support TLS to the NFS server and when the client doesn't have it yet but but that might be one way you could do this you could just say that we're only going to allow you to mount you know things that are have a certificate in the sign by and then there are auto mounts yeah I mean cross domain links and all sorts things you can find on network file systems I mean you could do the same kind of thing with you know SMB you know in the Windows example supports quick but you know you if you have an encrypted connection maybe you allow it but but but I think the UID mapping thing is super important because there's two parts to this I don't know of any same way maybe you do of of how to do UID mapping centrally I mean RFC 2307 doesn't have a concept of containers that's different that's a different problem yeah that's a separate question but but anyway back to the question of mounting of ID mapping the Hyper-V guys years ago added ID mapping inside for containers for I mean for the Hyper-V to the SMB protocol and if it's something that would help you I mean there is there's a container ID mapping inside the protocol we just never cared because nobody ever asked for it so I mean basically this demon is entirely generic right like and it will have one mass at way past in the regular file and then we'll do our thing but I like this is I have no illusions is gonna grow and add a couple features sooner or later so if the mount like mounting network file system is the thing that people want to be absolutely at this as long as there's some kind of sensible security story in place anyway let's let me finish with my slides here I don't know how much time actually have I probably ran over already long time I don't know okay so the unsolved problem one is that is that the security policy is implemented with LSM right now and that doesn't allow restricting ACL manipulation so basically that means I mean we really don't want that if I allocated a username space that I can use that with a transient mapping that I can go through that to write my user IDs to some other file system and we can block that all out except for in ACLs maybe it's not a big problem but it is a problem so the way this currently works and to stop me as soon as I tell you give stuff away is it allow lists based on amount IDs and so there are path folks already and most of the VFS for better or for worse and the only ones that don't have a take a path even though that's not necessarily a problem it's just the VFS underscore set underscore ACL method that I added and once you have you have that that shouldn't be a problem as well yeah but yeah from a security perspective I think this is the only vulnerability but it's not a major one I would say because it doesn't it doesn't it's not about owning resources it's simply by like they're gonna be a little bit to lose if people can do this so I'm not too concerned but yeah this stuff works and I can demo it unless there's anything else we want to discuss first so the demo basically consists of me running here the user dbd like the thing that allocates the user namespaces the user namespace ranges on one window so it's not particularly interesting you just see that it registers all the BPF LSM stuff and then in the other one when we start the mount FSD which is responsible for actually mounting something what's interesting to know is that the the user dbd is gonna run inside of every container because it needs to be able to hand out use ID ranges that are available in this specific container the Mount FSD is running on the host because it's the one that actually does the stuff that's the reason why there are two demons instead of one yeah so we have this running now and then we use the third window and so this is a tool system D dissect is a tool that it's just part of the system the tools that it all it does is look it inside of a disk image and give you like a manifest what's in there and this one yeah runs without privileges it's not and if I run this it just works and it shows you all the rides use ID so if you like this is actually use 80 zero it's not some kind of nobody 655 whatever use ID so this just works and in the background what you can see if you look at this is that there are two IPC requests made one to get the assignment here so here you see that it gets yeah well it's just scrolled away but anyway so you see the two IPC requests and on the other one you get another one and that's really it and it's extremely simple and yeah it's a little bit of a showcase how all the recent kernel auditions actually can can be extremely powerful because it uses the new mount API and uses that across namespaces right like we allocate the mounts and the host file system namespace set it up entirely then then we only attach it to the destination namespace in the last step it also use BPF LSM which is like the new hot shit and does a couple of other things so it's yeah ultimately I'm kind of happy because it's tiny it's socket activated so it doesn't even run all the time and the way how delegation works into containers is basically yeah for the amount of SD thing it's just one socket and you're supposed to bind mounted into the container and that's like to one specific location slash run and that's basically all that's necessary to to allow this stuff and there's nothing further configuration wise or anything will automatically determine everything from the username space that you pass then so yeah it's it's really simple because it's just one IPC call you do anyway an advantage is for example I think I mentioned this before the superblock isn't owned by well yeah the superblock namespace isn't owned by the username space you're mounting that file system in which means all of the destructive I octals that you have on a butterfus file system or on an XFS file system aren't available so you can get you know you give it an X of XFS mount or a butterfus mount but they can just fuck up the file system by using some sort of wheel I octal I don't know the frack or whatever but they own the mount which means they can unmount it because the ownership of the mount is separate from ownership of the superblock which I think is a nice side effect because another thing to the another consequence of making a file system mountable inside a user and asses users basically saying you own the superblock and you get to do all things with the superblock that you want which is usually also an additional attack factor so so this is something I thought of after you'd moved on from the slides so I guess I have a question first in any of these scenarios when you're mounting on behalf of an unprivileged context does the unprivileged user have the ability to modify the underlying raw file system image so it gets the ability to change whatever it wants because the files are ultimately owned by that unprivileged user right but because we the security poly is supposed to be that we enforce the invariate yeah it will result in EIO and hopefully the kernel file systems can deal with that part but if the user can modify it after the file systems mounted and you write on the verification then you have another attack vector for the unprivileged user sure but I mean this is what the invariate is supposed to deal with right okay I'm super familiar or FS variety or DM integrity with an HMAC or yeah I mean and then in the case what what Ted said about the FS check stuff right like he mentioned this already that we wouldn't make a copy first we would have to have make a copy first because otherwise the six attack is of course there okay sorry can I log if you've got a file that from an unprivileged user that they want you to mount can you just lock that file against changes the mandatory locking I saw that was kind of uncool these days but anyway yes FS variety DM variety whatever like solve the issue properly go back to network file systems one aspect is the network name space that's used is there would it be trivial to pass like the network name space FD as part of the create mount for network file system well I mean I guess that's up for them network file systems if they can even make use of this I don't know NFS but I'm you probably would have to have an extra mount option where you tell NFS oh by the way dude in this mountain space network names but this is probably something you should discuss with an NFS people because I have no clue about that recall correctly we just inherit whatever network name space you're in when you call them when you do the mount so I so I guess as long as system these which is network namespaces you could but the new mount API you can just add a thing to say here's an FD for the network names place you want it to be in if I remember correctly it captures it when you create the super block yes we can add an option to allow that to be overridden yeah and anyway the goal is I already mentioned this it's just gonna be in system B so that people don't have to argue about this and it's gonna be enabled by default that's also the goal so that yeah it's just then if people don't want to use its own fault anyway that's all I have