 Hello everyone. So I'm Stefan Greber, work at Canonical. I'm running the Lexi-Lexi-LexiFS project. I've been doing container stuff for well over a decade I think at this point. And you may notice there's a slight issue with people here. I'm supposed to have Christian with me, who is, well, not here. He's been around all week, but because of the lingust rebuking fun stuff that happened over the weekend, he ended up being rebuked on a flight that's living right around now. So instead of presenting with me is actually at the airport. So I'm going to be covering kind of content for both of us hopefully. I can't answer any questions you might have on the stuff you would normally have been covering. All right, so I'm just going to start with the basics. In case, well, I'm not sure how familiar people are honestly with the username space. We've done talks similar to that at LSS, I don't know from America or Europe or online for a few years now. But in case you're not too familiar with the username space, just going to do a very quick intro. So the username space was first developed by Eric Biderman. It's been effectively fully merged in Linux kernel either 3.2 or 3.13 kind of depending on what you consider to be fully usable. I tend to be in the 3.13 camp. So late 2013 effectively as far as timeline. So we're getting very close to a decade at this point. The username space allows for a process in its descendants to have a UID and GID map in place and it effectively allows a container that can have something that looks like roots that can have something that looks like a number of UIDs and GIDs that are actually mapped to something completely different on the whole system. Getting you to the point where in theory root inside of that namespace would be about as privileged as a nobody user on the system should they find a way to escape. That's the general idea there. Anyone can create them. There's like anyone can just create a new username space and anyone can map themselves to any UID or GID they want inside that. So pretty commonly you will just create a new username space and then promptly map yourself as being UID 0 and GID 0 inside of that namespace. If you want more complex mappings because a normal user doesn't actually own more than just their own UID and GID then you need a privileged process to help you with that. Most Linux distros through the shadow project will ship with a couple of tools called new UID map and new GID map which then read effectively set UID binaries that will read a file inside of its UID and its GID respectively that then allows delegating effectively a range of UIDs and GIDs to an privileged user so that they can directly use that with the username spaces. Also if you need to configure networking cannot the same deal if you create a new network namespace from inside of a username space well you get no networking whatsoever and you can create some network devices there you can create like a virtual ethernet pair you can create a dummy device but that won't really get you any kind of functional networking if you want something that's connected to the outside world you're going to need a privileged process to help you with that. So I'm just for the basics going to show how to do you get around with just creating a username space so I'm a completely normal user on a normal system nothing fancy whatsoever you use the uncheck command which these days pretty much comes by default on most systems and capital U is I want a new user namespace and often you're going to want a few more namespaces just to make things more usable so dash P gets your PID namespace dash M gets your amount namespace then you've got dash R which is the flag that remaps your UID and GID to root well so UID 0 GID 0 inside of the namespace and lastly because you're using a PID namespace you should have the command fork for you so that you end up being properly inside of that namespace you run that and hey you're root I mean yeah so as far as pretty much everything is concerned you are root but you are root inside of a namespace with a map that just maps your UID and GID over to that root user so anyone can do that from that point on you can like in this case I've got my own mount table so I can start mounting if I wanted as a privileged user to just oops I can't start today okay yeah so I can mount a tempFS now I can mount some virtual file stems I can do that kind of stuff I can do bind mounts all of that works and is effectively allowed well available to anyone now going back to slides so what's kind of the most common setup as far as actually consuming this stuff well other than what I just showed which is like the single user case where just a single UID is mapped through most of the time you're going to want 65,536 UIDs and GIDs being mapped that's simply because POSIX is kind of nice to have a lot of things are going to get mad at you if they don't get there nobody no group mapping so most of the time you're going to be mapping that entire range through just to have things being more usable you also will pretty often want additional namespaces most of the time if you're going to try to run like a normal Linux application you're going to want the PID namespace the mount namespace UTS which is primarily for the host name network namespace potentially C group namespace and most of the time you also want the IPC namespace so most container managers effectively just take all of them and bind them all together but you can still mix and match depending on what you want what you want to do like we've definitely seen use cases where you may want to share your PIDs with the host or you may want to share your network with the host which guys just don't create a new instance of those you can run with the full host map if you want to so as I said in this case I just had one UID and GID configured but with a privileged process you can write whatever you want including creating a username space that effectively has no real map by you writing the host map so the entire range of possible UIDs and GIDs into the namespace you can do that you shouldn't do it you should never ever map UID the real UID 0 to anything because then you've got the real UID 0 which is a very bad idea but you can it's the the other thing that's kind of interesting is that you do you do get UID 0 in there and you do it does look like you have every single capability but you need to keep in mind that that's not actually quite true you effectively have those capabilities against resources that belong to the namespace that's why in the kernel you've got both is capable and is ns capable the ns capable one being what actually checks that stuff if something is doing a normal is capable check you're going to always fail it from inside of a username space and if you want extra security what you should probably do is actually have non-overlapping maps for every one of your containers every one of your environments that's mostly to avoid issues where there might be some kind of resource limits or resource counters that start directly to a UID which could then be used from within two different containers and you could effectively do as the other one all right so just back to demo stuff just going to be creating a few containers here real quick I'm just checking this works yeah 224 you want okay so this is going to be just a basic container as we create them in next day so it's going to be using a username space but it's going to be mapping a lot of UIDs and GIDs so we can see it really maps every it maps what is it like a billion UIDs and GIDs starting at one million and same thing for the GIDs so that's going to work in our default but you can also do security ID map isolated true which then has like the look for a trunk of 65,000 556 UIDs and GIDs by default so if I go in U2 and I look at proxy UID map and if I do then the same thing in U3 that shows you that's like safer case where we're effectively using different maps for each container all right now let's talk about the file system because that's where things get interesting with containers so in the early days and that's still the case for most people dealing with username space and container especially the username space maps UIDs and GIDs the file system writes well needs to store also UIDs and GIDs but what the file system will store is the host value so effectively the map is applied and then whatever you end up on host is what ends up on the file system there's an exception to that which is for virtual file systems so if you're inside of a username space and inside of a map name space that's tied to it and you mount something like tempFS or fuse or something or not then what's sent through the file system is the UID and GID as you see it within that name space instead of the translated version that's mostly okay that was a convenient design of the username space that meant we didn't need to do anything like extended attributes to keep track of UIDs and GIDs all over the place and any of that it just worked on very much like every file system and everywhere that was convenient but there are some issues with that the most common issues you're gonna get are sharing files with the host that can be kind of useful you're running a container you want to share I know some directory in your home directory or you want to share some system files or whatever it's kind of useful to be able to do that you technically can the main problem is that because those files are going to be owned by UIDs and GIDs that are not resolvable they're going to show up as the overflow UID and GID so it's going to all show as 65544 that's a bit annoying also because that UID actually can exist inside of the container but not be the same like the effectively the hard-coded overflow UID is a valid UID and it gets a bit wonky plus it doesn't let you write in most of the cases even if you've got full write permissions it doesn't really work the other case that's very important is sharing root file systems with other containers that are either using a different map or that are not using a map and that's because that's what unfortunately most people do docker Kubernetes all of those those application containers that are based on top of layers because those layers are effectively used as a shared root root file system for variety of containers they can't themselves have different UIDs and GID maps like you if the layers were themselves already shifted then it would be fine for one container and it would be showing us the overflow UID for another so that doesn't actually work um so that's been a bit of a problem the the other problem would be if you're running isolated containers which I normally recommend people do in production you don't really want to have any kind of overlap for your for your maps if you do that and you then want to share I don't know the tree of your web server or something like that between multiple of those containers then it's not gonna work whoever writes the stuff will be fine and all of the others will sit as the overflow so that's a bit of a problem um and I can try to show the way that stuff currently works here if I look at this guy here um on purpose forcing it onto um a setup that doesn't have any kind of id map shifting or that kind of stuff which this is a bit hard because the camera actually has some fixes for the issues I'm showing um so it should get us a container soon enough it's taking actually a bit longer to unpack than expected okay there we go so we can actually see briefly show dream mapping container that's because it isn't privileged it's an unfinished container with the old school way of doing it so we took an image that is has the wrong UIDs and GIDs and then promptly had to rewrite all of that stuff the way that actually looks like so if I go inside of the container I should see that things look normal um if we look at how it's mounted we should just see in this case it's using um it's using zfs fine it's mounted nothing weird and if we go and look at it's actually hidden in a mountain in space I just need to go and enter that okay and then I go look at the actual file system tree containers that we can see that everything that's actually stored on disk is all shifted um that works in this case for lexity because we are running system containers they each have their completely on standalone file systems so we can actually go and rewrite all the UIDs and GIDs on startup it's fine if you try to do the same thing with a docker container you're gonna because it's using overlay fs it's going to effectively fork the entire file system which is not exactly ideal um so not really an option for in in that scenario so that's that's kind of the state of things now what can we do about it well our first attempt to do something about this was with something called shift fs um occasionally we forget the f um the first one um it's it's been a bit of an issue uh and so it's effectively it was designed by james balamly originally then picked up by by my team at canonical we did a bunch of work to make it function it is still part of the ubuntu kernel because unfortunately the nicer way of doing this which I present next does not yet cover all the file systems we care about so we still need to keep it ran for a little while longer um the way it works is that it's effectively an overlay file system it's very similar to something like overlay fs and pretty much all that it does is it lets you do uid and gid translation across uh user namespace it's got a lot of issues um it gets particularly funny with a lot of the kernel caches bfs cache or that kind of stuff we tend to have a lot of things not getting properly invalidated and having some weird behavior where like we see the wrong content of the file or the wrong permissions or the wrong everything um we've been running into a lot of weirdness around that the other thing that's particularly fun with file systems are ioctols and when you are running an overlay that reports itself actually us being a different system name but then needs to still support the ioctols of another file system say we were supporting doing burr fs volumes for example uh yeah that's been fun we did manage to mostly make it work in the cases we cared about but it's it's a massive headache and that definitely proved that this was not the right way to do it um but was still a good exercise now if i switch back to my laptop here i do have so that you want container created earlier this one should actually be running on shift fs yeah so here we can see that the mount is actually released as shift fs and there's a pass through equals three which is the mode we need to get ioctols and that kind of stuff sure now if we go look actually let us route mount run run the k um so if i'm in that month new space now and i go look at pass not please the command next containers you want root fs so shift fs does do the job in that unlike the previous tree i showed you this one is effectively shift by shift fs and so is written as the normal the actual the only inside the container it matches the outside of it as far as the file system is concerned not as far as any of the other security bits like all the processes and everything are still correctly shifted but the file system is not and that means that creating those kind of instances is effectively instantaneous because we don't need to go and rewrite the entire tree we can also easily share mounts with the host no problem we can share stuff between two containers so effectively all of those issues are effectively solved by this other than it being a disaster because it's a file system it doesn't actually quite do quite did well or doesn't deal well at all with all of the edge cases all right so what's the proper way to fix this well we actually have it uh it's vfs id map shifting or id mapped mounts in the kernel uh that was done by christian browner who used to be on my team at the time and who was supposed to be presenting alongside me now and um that's yeah the real solution to this and her problem it's done a string extra system call and flags all in the kernel within the vfs there's currently support and i might have missed a few because they keep adding a bunch uh ext4 xfs v fat bar fs f2fs and the value fs the one that i'm aware of as far as supporting this feature um we are working upstream with zfs to get that one sorted i also need cffs to be sorted for cffs christian has a big patch that that does it but there's some security concerns so it's not much yet um this doesn't use ovaries alone it's all done at the vfs layer so it does the right thing around ioctols and all that stuff there's no weirdness alone and it actually pushes some of that logic into the file systems that can then decide what is safe and what is not uh so really nice and convenient that was introduced with 512 uh at the time i believe it was ext4 and v fat maybe with xfs being happening very shortly after borough fs i think was 515 um ovaries fs is more recent than that so that's been rolling out slowly there's still a lot more file systems to go uh there's currently no support for that on any kind of network file system either so there's a bunch more things that are needed um i can show you this guy in action so i think we need to switch back to this system instead and for that one so if i do we're going to turn 242 so i'm backing again because this time it's not on zfs this time it's doing it on ext4 again because cfs doesn't actually support this so i need to to run it on ext4 instead so the unpack took a little one there we go and if i go in there and i look at hox self mount info and just look at the main mount so this time we can see that it is reported as ext4 and if you look around the middle there's in the mount flags you can see id mapped now similarly to the other the ownership is perfectly fine inside and now if i go look i think i must have it in the history yep um and i go look at vastnap ext containers u2 rid fs so exact same behavior as shift fs but no shift fs inside effectively uh and no weird bugs and caching issues and all that kind of weirdness that's all gone which is very nice and convenient so that's the new way of doing things it's great now moving on different topic um new new namespaces um there's been some discussions over the years of like oh we need to add so and so i think the most recent namespace we've had um now is the time namespace but that's been a few years already and um but there are a few more coming up and it was quite interesting to go to linux tremors and kind of summit this week and talking to one about a few a few things that's going on around there um the new pattern for a lot of the new namespaces is not to have them be completely standalone namespaces um that would normally use their own clone flags uh and be like yeah completely separate instead the the the new trend is to have them dangle off the username space so you use the username space and then you can enable some specific additional namespaces that makes things easier um and because technically there's nothing preventing you from having a username space that's effectively useless by having the username space um use the entire host map you can still use that even if you do want to run privileged processes the first one of those would be the for integrity measurements so the imar namespace that's been a work that's been ongoing for quite a while now mostly driven by Stefan Berger at IBM uh there's actually been v14 of this patch set submitted this morning so there's definitely actively worked there there's a full talk on it at lssna so the the one in astin you can probably find recording for that and yeah so it is the namespace is tied to the username space and what it does is reasonably straightforward too it allows for the use of imar inside a container so that you can have um everything you access effectively be measured and be able to to run well an imar policy against it the other one we've been discussing this week is the tracing namespace so that was brought up by matilde noyer and it allows again kind of makes sense based on a name it allows for running tracing tooling inside of containers so that would then allow tracing any task within the container or within one of the children of that container that's that can be pretty useful it's going to be interesting i think it's one of those namespaces that's going to have to kind of grow over time we're like a very very small subset of things would be allowed at the beginning and then that would grow because if you think of what you can currently access from something like BPF trace it's a bit scary so it's going to be difficult to make it perfectly safe but that will just put the infrastructure in place and then bit by bit more and more can be can be made available and that should be a very useful feature for debugging or for simulating defense systems there's also been a separate discussion kind of related to that around resettable sequences and how that all fits with containers and CPU sets specifically the current way they're trying to get some of the vcpu id feature of resettable sequences is to effectively pre-allocate a bunch of memory one per potential cpu id the issue is that it's literally per possible one because cpu hotplug is a thing that exists so you need to look at what's the total number of cpu as you might ever have on the system which i think on most 66 systems gonna be 64 something or not depends on this shows some of them clamping down firmware can also set it down i believe but that makes that could create some issues by having to allocate a lot more memory than you want so there was a discussion around can we like we'd like this to be kind of tied to the cpu set and also where do those maps then go so the current plan but that was towards the closing of planners and things would probably change would be to tie some of the some of the maps to the ipc namespace because that feature is primarily used for ipc and then also look at the cpu sets to figure out what's the total number of cpu's that can be accessed and then use that instead of allocating all of the possible ones the sims would still some difficulties there because it's possible to reconfigure cpu sets a group to have more cpu's and if you put checkpoint restore in the mix which is always very fun then it's possible to take those tasks and move them to another system that's got a completely different maximum number of cpu's so there was also some discussions of maybe capturing that inside of crue so that if you're live migrating tasks from one system that's got a um like i say a limit of 32 possible and you move the system that's got now a limit of 64 possible so you would effectively have crue automatically set up the cpu set on on the receiver to match the maximum of the source one kind of avoiding that issue so there's still a lot going on around that it's pretty interesting topic it's going to be interesting to see what what the kind of final solution ends up being and again different different kind of topic um how to restrict the username space so i said it's been around for about a decade um but there's there's still a potential attack surface um it still exposes a lot of stuff in a kernel that normally would not have been exposed to inferior users pretty much all the bugs we can find from that are not bugs of the username space they're pre-existing bugs elsewhere in the kernel that just were not really exposed before um but still there's often people want to to restrict that and it used to be done with a pretty darn big hammer of like let's just not build a thing in the kernel at all but that's less and less viable because people like to run containers they like like even non-containers like there's you know we've seen web browsers using username spaces we've seen a lot of things using using username spaces as a very convenient way to to read what actually reduce privileges on specific processes so just turning it off is not really an option now what can you do today well today you can you can limit them a tiny bit uh effectively there's a bunch of view counts that lets you restrict how many namespaces of various types a user can use that's okay sure you can set that to zero but then you don't really have anything anymore um you don't really get to control per user you don't get to control um per process or anything your other option there would be to use second to force a tree to not be able to use the username space by effectively doing flag filtering on things like cloning and share the problem is that clone three is a thing that exists and we cannot filter it because it's using a struct as a pointer and so second can filter out you can block it off entirely at which point you just rely on clone two and that one is filterable the problems that eventually there's gonna be a bunch of more useful flags in clone three that people will actually want and we won't be able to just block it the other option there is that a bunch of distros have custom patches to add syscatals that either lets you on the fly turn off the entire username space or turn off its use by unprivileged users so that's kind of the state of things i can just show you what the what that looks like as far as the the flags so if you look at poxsis user today you've got those max underscore whatever namespaces if you change if you set one of those to zero then nobody can and share that name space anymore that's but it's about the extent of the flexibility around that so there's been some discussion to add a new lsm hook that would actually take care of that issue by allowing lsm's to make the decision as to whether a new username space can be created or not patch sets came from cloud flare i believe fred rake lola or something like that seems like an approach that the security community and a lot of people think makes sense uh unfortunately the maintainer rake miderman does not think it makes sense and it doesn't really want to see any way to to restrict username spaces um so there's still more more discussions to to happen there and see how that can be done um maybe it wouldn't mean um maybe there's a way to convince a rake to actually want this and make it go through um or potentially i don't getting something maybe more generic which is not just username space but is about more around the clone flags and being able to say you know have an lsm hook there looking at the clone flags and then being able to allow or reject the the clone operation or the entire operation based on that so that's still uh ongoing hoping to see that unblocked because it would make it a lot easier for people i don't like the big hammer of let's just turn it off entirely i'd much prefer having distrual system policy that can um allow just like trusted processes or whatever to to then be able to to use to use the namespace all right again different topics still related to the namespaces uh system call interception so i mean our our goal has been now for years uh to completely kill off the concept of a privileged container so previous container for anyone not aware is a container where real roots well there's no id map in place real root in the container is real root outside of it yes you can try and drop capabilities and put lsm's in place and all that kind of stuff to try and paper over all of the issues but at the end of the day you still have a bit of an issue if anything runs as we root in the container chances are they'll find a way to get out and if they get out they're full real roots which is never a good day um so we want to get rid of that the problem with and previous containers using the username space is that what you have about as much privileged as a nobody user that's a good thing for security but that's not necessarily a good thing well it depends on your workloads that's that can break a lot of stuff system call interception let's just get around that because we can intercept specific system calls send that to a privileged process outside of the container and then that privilege process can do whatever checks it feels like and then perform the action on the caller's behalf deciding like getting just the permissions just right to take the extra permissions needed to do the action but nothing else and then effectively replicating as many of the permissions and setup of the caller as you possibly can that lets us do things like loading attached bpf programs after validation mounting five systems setting advanced scheduler flags loading canal modules a lot of really things that people tend to think we're kind of crazy for doing but if we do the right kind of checks around we can actually make it reasonable so there's kind of two different things around that I usually consider around system call interception one thing is dealing with trusted resources so in this case it's effectively it's a system call that normally would not succeed but we know that what it's running against is actually perfectly safe so even if we don't really trust the caller we trust that the thing it's trying to access is safe and therefore we allow it this includes things like mounting some network of virtual five systems uh it also includes actually mounting full on blocks which is usually terrifying because you don't want an interested chunk of random block device to be thrown at the super block parcel and see what happens but if we know that the block device has never been exposed to an interested user or process so they never had the opportunity to write to it and the host itself is the one who allocated that block device and formatted it then it's actually fine um so we may allow mounting in that case um same thing for BPF program it's also a very bad idea in many cases but if we know exactly what program it is we can validate that it looks like what it should be and it's attaching to a specific points that we know is fine we can allow it the current thing we're doing for that is the the devices cgroup under cgroup v2 is now BPF based and um well we wanted that to work so we've done the work to have a flag inside of lexity that allows specifically BPF or devices cgroup to be attached to it's um also uh there's something that we're still need to do but I also want to allow loading kernel modules and that one is extremely scary for obvious reasons but uh the the plan effectively is we'd like someone to be able to load some specific um say net filter extensions not any of them but specific ones that we know are fine and the idea there is that they would do the normal mod probe that then calls init module we would uh look at what they're passing figure out what the module is we would absolutely not load it from there because that would be an extremely bad idea um but we would then after we've checked it against the allow list um go and load the host value the host version of that particular module so the caller gets the result they want they have now the proc files or netlink API or whatever that they wanted but we never trusted the module they actually provided we just checked that it's a it's a module that we generally allow for dynamic loading that saves us from having to preload a lot of modules that may or may not be used on the system um so that's something we're going to be looking at doing that the other aspect of it would be trusted workloads um so that's things like yeah we want to use username spaces because we don't want we don't want to use anything else but then it's a trusted workload running in there that needs quite a bit of extra permissions say you're building a full os image using the bootstrap using like then creating an actual uh loop mount like a loop device that you then mount and everything that's the other case well like we literally we know that this particular workload we trust it um we don't want to make it fully privileged necessarily because there's no real good reason for that they only need a tiny bit more uh privileges so we can do it that way the permission we would be giving could potentially be used to attack the system so that's definitely needs trust in the workload but that's the other thing we can do uh and that still saves us from running a fully privileged container all right um so in conclusion well the introduction of the the vfs id map so the id map to mount feature uh I really see as being a game changer because for containers and username space adoption the big big element in the room for years now has been docker kubernetes um like that's that represents the vast majority of containers out there and they are just not safe um like the fact that they effectively run privileged and the main like they rely heavily on second partner partner or they rely on just uh switching user and not running something as rude which that part is actually fine um the problem is that a lot of cases they don't do that it's it's kind of terrifying uh I'm really glad that we've got with this we've got a solution that can be used for the kind of layered uh five systems that they are using to still ship all of their layers unshifted and then be able on a per container basis to decide okay well this thing is going to be unprivileged therefore I'm going to be using vfs id map on top of that overlay file system and now you get to run with the username space in place and all of the work we've been doing uh over the years now both enabling the second identifier uh and then the the user space side of implementing software to actually process notifications and increase permissions means that even the few the few images or cases they may have that would not work very well instead of a user namespace a lot of those are factory solved problems that we can safely check and safely allow um so I'm really hoping that now I mean it's going to take a few years for that feature to be available widely as I said like the the overlay fs uh bit I think was probably merged in the last kernel release or something um so it's going to take a few years for that to hit LTS in different distros and people to actively use that but we've we've got like a good path forward there and I'm hoping that I don't know probably given another decade nobody will use previous containers anymore maybe um that would definitely be great and we definitely like to see to see that adoption um really like especially for local Kubernetes um it's also very exciting to see new user namespaces being added the the work on IMA is very interesting like it's it's it's going to be interesting being able to actually test something like IMA without having to test it in like a full virtual machine on your on your on your entire system I mean able to do it on a per container basis that's that's really really interesting um kind of something goes for tracing that just makes it possible to have containers that feel even more like a virtual machine like even more like a standalone Linux system um so also quite looking forward forward to that work I think it's a good pattern that we have now that effectively all of the new namespaces and the dangling off of the username space that really simplifies the review simplifies the infrastructure inside of the the Linux kernel um and yeah should allow for for more namespacing of um different parts of the Linux kernel as we as we see them uh second interception is also a great great workaround uh for a lot of cases that normally we were just like well you're running as an previous user and previous users just don't get to do that um and the only way we had before to deal with that was patching user space to not do it um which is not great or say okay well this thing absolutely needs a virtual machine or previous container now we've got another solution uh it works pretty well it is tricky um like it's it's not very pleasant having to do all of that stuff the the previous process that runs on the host that needs to actually perform the action that's extremely security sensitive um so there's still issues there um but there's at least a workaround we can use and that's been working pretty well for us and yeah I mean my kind of conclusion like I'm hoping that given a few years maybe a decade we'll finally have previous containers be a thing of the past that will rely on well have containers that are safer in general and yeah just uh well more namespaces more features containers feeling a lot more like individual systems and that's it for me uh I believe we've got about four minutes for questions um if nobody has questions we can do an early lunch but any questions looks like people are hungry thank you did we get any online question maybe uh nope okay well I think there's going to be an early wrap then thank you very much