 So, hey, I'm Christian, I work as a kernel engineer and on the Lexi and Lexi team at Canonical and I'm going to give the subsystem talk for namespaces and capabilities, or as Kay said, if containers are the piece that is painting a target on the kernel, we're providing the paint, essentially. Exactly, so a short introduction into what most people know what namespaces and capabilities are, so I'm not going to do any sort of deep dive, but capabilities are essentially a way to split the root privilege into distinct units of privilege, if you want to put it like this, so that you're not just root can do anything, but you can safely delegate certain types of privilege to unprivileged users by, for example, using capabilities. You have CAPSIS admin, CAPMAC admin, CAPNET admin, and so on, and they all regulate certain types of things you can do. So, an announcement I can make is SlipCap now has been released on the 10th of September, 2018 search. Andrew Morgan and I added full ambient capability support and support for namespace file system capabilities. The library also has moved and Andrew Morgan is back maintaining it, so that's pretty cool. He has been AWOL for about a couple of years, I guess. So, namespaces, what are namespaces? I always like to say it's basically a very lightweight virtualization method for various aspects of a system, and in contrast to virtual machines that virtualizes, well, basically the whole system at once, it's always just about some specific feature. And what I mean by this is we have now one, two, three, I always need to count seven namespaces at the moment. Mount, pit, a UTS, IPCC, group network, and user namespaces, and I guess the easiest way to understand what a namespace is doing is, for example, to look at the oldest one, which is the mount namespace. So, ignoring mount propagation for now, if you create a new mount namespace, it duplicates your mount table. So, all of the mounts you had in the initial mount namespace, you now also have in the answers the mount namespace. And if you do a mount or you mount operation in there, ignoring mount propagation, it will not be reflected in your parent amount namespace. But it's misleading insofar as actually amount namespaces are not hierarchical. There are some hierarchical namespaces mount, it's not one of them, but that's sort of the idea. Network namespaces, network interfaces, C-group namespaces give you the idea that you are at the root of a new C-group tree, even though you're further down in the C-group hierarchy, IPC namespaces, Cosics message queues, and so on, are isolated per namespace. Pit namespaces are one of the few namespaces that are properly hierarchical, means each ancestor descendant, each ancestor's base, pit namespace, can see all of the processes of its child namespaces. So, they're probably nested, you could put it like this. And there are a couple of other namespaces that are in the works right now. So, there is a patch set up for time namespaces, this is something I'm going to touch upon in a little bit. Device namespaces were supposed to be a thing, they aren't really, but we sort of have a device namespace right now. I'm going to talk about this in a little bit. And there is talk about an IMA namespace, although this is likely not a real namespace, but will be tied to a namespace, but Mimi is probably way better equipped to talk about this than I am actually. So, the one final namespace I wanna mention is the username space, this is the one that usually not a lot of people know too much about. How many people are more acquainted with username spaces? Oh, actually quite a few, but not the whole crowd. So, for most of the other namespaces, there is no real privilege separation going on. That means if I create a new mount namespace, network namespace or whatever, I don't get any strong security guarantees whatsoever. I can still shut down the host, I can still shown files and so on, the system is still fully under my control. And sort of one of the ideas was, okay, if you wanna run untrusted workloads in the age of containers, you should better have some way of making sure that you somehow isolate yourself sufficiently from the rest of the system so that you cannot easily put it in danger. And so, why not introduce a namespace that isolates all of the privilege concepts that Linux system or a standard Unix system comes with. And this is sort of the idea, I guess, for username spaces or as high, I like to see them. Introduce a new namespace that just deals with privilege separation. And so there are a couple of requirements in the way that username spaces work is, first of all, obviously what carries privileges on a Unix system, the UIDs and GIDs, so you separate the host UIDs and GIDs from the namespace UIDs and GIDs. But such in a way that your user and as root ID is privileged over the user and as. So what it means is you can be UID zero from within your namespace, but outside in the parent namespace or the host namespace, host user namespace, it will run as a fully privileged UID, like UID 100,000. And there is a mapping established between those two. So within the namespace, within the username space, you have apparently all the privileges this root has and it's semantically and syntactically somewhat similar to what username as root means on the host, but it's isolated. So it cannot affect any resources that are global, for example, on the system. At least it's the idea. You also want nesting to be possible, so you want it to be a namespace because as an unprivileged user, you should be able to create a new username space. Within a username space, you should be able to create another username space and so on, so you can have layers of isolation and layers of nesting. And so this is something I mentioned before. The user as root ID should not be privileged over any resources it does not own. So for example, if you have a global limit on the maximum number of files that your system cannot open, user and as root in a username space should definitely not be able to set this. Only the initial root user should be able to do this. And the last point I mentioned to unprivileged users should be able to safely create a username space. And the last point is capabilities, which I mentioned at the beginning, they should all be charged against the username space, meaning so if I ask the question, do I have this capability, then what I'm really asking or should be asking is do I have this capability in the current username space? That model breaks for some capabilities like for a long time, for example, if you ask the question, can I create device nodes, then the kernel would check, do I have cap make not in the initial username space, not in my current username space, although that has changed in recent current releases. And you introduce the concept of an owning username space that means all of the other namespaces have an owning username space, such that if you create a new username space, then unshare your network name space. That network name space will be owned by the username space that you created. So basically, if you ask questions like, do I have CapNet admin to operate on this network name space, then the kernel will look, what is the owning username space of that network name space, and do I have the capability in that username space? So that's the whole idea. And obviously, all kinds of other resources, like specific files, syscuttle files, and so on, they can be made per username space or they stay global. Yeah, so that's it for capabilities and namespaces. So what has happened? And actually, I don't know when the last talk was about namespaces and capabilities, so I'm covering 4.10 to 4.18. We can skip stuff if we don't have sufficient time, but actually, I think that has happened a lot. So in 4.10, I think it even started before that. There was a lot of work done by Eric and I guess by Seth and a bunch of other people to make, to basically get the infrastructure in place to enable mounts from non-initial username spaces. Something which hasn't been possible for any interesting file system before. Like, I mean, you can mount a TempFS or whatever inside a username space, but that's not really interesting. A lot of people would be interested to mount, I don't know, X4 inside of the username space, which is still not possible, but technically, the infrastructure is there. You can do it with Fuse, though. So there was a lot of work done in 4.10 around this. Also, Eric added a new user and his owner to the MM struct so that you can now have sensible P trace permission checks across EXAC. This was a security issue that had been around for a long time. I think there was a couple of fixes going in after this, as well, but that's actually quite important. Andre Vagan, I have a good memory front name, so I should hopefully be possible to remember most of the people that did the work, but don't hold me to it. He added an Iocl to get a socket network namespace, so sometimes you need to ask the question, does socket file descriptor I'm having right now which network namespace does it belong to? This is infrastructure to actually give you an answer to this question. For ELF, 4.11, ah, yeah, we finally got a limit of Inodify instances per username space, because before, you could technically exhaust the global Inodify limit from within the username space. This hasn't been, or hadn't been accounted for. This landed in 4.11, security fix. And Michael actually added a new Iocl namespace, Iocl syscode, infrastructure to create a hierarchy and properties of namespaces. So for example, you can ask the question, what is the owning, what is the UAD of the creator of the specific username space, which is something that is pretty helpful, because you can answer the question for yourself, do I have privilege over this specific username space? Also, you can get the parent of any hierarchical namespace that only affects pit namespaces and username spaces, so you give it an FD to a pit namespace, and then if you have the right permissions, it will give you back your, an FD to the answers of pit namespace or username space. And also NSGetUserNAS, which is related to what I said before, all namespaces have an owning username space, so if you give it an FD to a non-username space, it will give you back the owning username space. Pretty helpful. And further work, and this is an ongoing topic, I tell you that right away. The infrastructure to enable and privilege mounts, so mounts from username spaces, so there was more work done there. Yeah, we will see this popping up. 4.12, so an interesting feature. It exposed the pittiness for children in proc pittiness, which always had been in place for the kernels, so, oh, how can I make this relatable? So think about the question, what pittiness space are children that I'm going to fork off going to end up in? The trivial answer to this question seems to be, well, in the same pittiness space that I am in, well, not how set-in-s works. So if you set-in-s to another pittiness space, you will not, it's by itself, change the pittiness space, but if you fork off a new process, then these ones will become a member of the new pittiness space that you actually set-in-s to. And so pittiness for children and pittiness, a value under these proc files, will actually be different in that case. And tools like Crewe, so checkpoint restoring user space, needs to, sometimes needs to know this question, when, and the answer to the question, when restoring a task. Yeah, we also, there was also support added for fuse and pittiness spaces. So there was now proper translation if fuse is run in a pittiness space. So fuse will take care that the pittiness actually translated to a valid pitt within the pittiness space and so on. It wasn't the case before. And I guess it's not super important, but it was also some work done to enable namespace information in Perf output. I've never used it before. Actually, when I looked at it, I didn't even know that it would just land it, but good to know. 4.13 saw Eric doing some work around, well, bad humongue performance. It's actually a bug fix, so probably I shouldn't have put it in, but it actually increased humongue performance dramatically. So if you had overlapping mount propagation trees, the old humongue code could take up to 60 seconds. I think Andre Wagon discovered this. And Eric refactored the whole humongue logic such that it tastes down to 0.06 seconds. So that's actually pretty good. And the code is pretty interesting too. Antigen in 4.13 added an NS delegate option to allow C-group delegation, safe C-group delegation for the root user. It was always kind of safe for unprivileged users, at least if the system administrator set it up this way. But it wasn't safe for various reasons for the root user. And right now, if you mount the C-group tree with the NS delegate option, then C-group namespaces are considered delegation boundaries. So you cannot escape limits. It's pretty useful for, I guess, privileged containers, which you shouldn't run. You should always use user namespaces. 4.14 introduced namespace file capabilities, actually done by Sir Talon, a good dude. And actually that was quite a bit of work, something that we wanted for a long time, unprivileged file system capabilities, or in general, file system capabilities weren't safe before. So let me come up with an attack. So imagine you are allowed to create a user namespace as an unprivileged user. So you create a new user namespace. You set a file system capability, whatever, Capsis admin on an arbitrary binary that you just wrote. You, in another terminal, execute that binary on the host and, well, you're screwed. So it wasn't safe for a long time. It's now, since 4.14, it is per username space. So basically the kernel records a root UID, which it considers to be the UID that a namespace, or UID root inside of a user namespace needs to be mapped to. And if it detects a mapping for that specific root UID, then it will grant you the rights to execute that file with elevated privileges. Yeah. So for example, if you write a file system capability with root ID 100,000, and you try to execute it on the host, the kernel will look at this and we'll see, it's not UID zero, so I'm not granting you access to execute that file with elevated privileges. If I go into a new username space and establish a mapping such that root ID 100,000 corresponds this map to UID zero with inside of the username space and I execute that file, the kernel will see, oh yeah, there is a mapping for this, it's fine. You can execute it with elevated privileges. That's sort of the gist of how this works. 4.15, it's actually worked on by me in this case. We have bumped the limits of allowed username space mappings from five to 340. The 340 limit is not arbitrary. It's actually enforced by the kernel and the kernel, the structure that is used. So it needs to fit into a cache line. 340 is basically the layout of the structure such that it doesn't exceed the cache line. If you go any higher, it won't work anymore. So this is useful mainly for the case when you, for example, usually when you run a container that has ID mapping specified. You sometimes want to be able to write files to your home directory with your UID and GIDs but all other UIDs and GIDs should be isolated and mapped to, so the host UIDs should be isolated from the container UIDs and GIDs. But you punch a hole into the map that you established by saying, for example, user ID 1,000 and the host is mapped to user ID 1,000 inside of the username space. But you can only do this for like three or four UIDs and then you're running out of mappings. That's just the limit. Was for a long time. Now you can do it for 340 mappings. It's actually interesting. The overhead is negligible, actually. So for five mappings, you look at 145 nanoseconds stat time, mean stat time for a file for 340 mappings, you're up to 164 or something. So the performance impact is also quite okay, I guess. 4.16 saw some new infrastructure implemented to query network namespaces or peer network namespaces by passing along a network namespace identifying property. So RTM new link, delink and setlink basically allow you to pass along a property for a network namespace and you can operate on that network namespace without having to set an S into this network namespace, which is pretty performance relevant, actually. And in 4.17, we finally got to make unproved fuse mounts worked with IMA. So this was work that has been done by Eric and also in conjunction with Mimi, right? So there were some questions how to, I guess, validate unproved fuse mounts with IMA, right? And it fails by default right now, but you're probably way more better equipped to talk about this than I am. So it was one of the final blockers to actually make unproved fuse mounts from user names or non-initial user namespaces work. We also fixed the longstanding back whereby bind mounts of DevPTS, PTMX to DevPTMX did not work. So you could have sim links, you could have a separate device node, but if you tried to do a bind mount, the kernel would just not recognize that it's basically the same mount. So we added logic to make this possible right now. This is relevant for the case where you have an LSM-like app armor that tries to, for example, restrict access to certain files via sim links. So a bind mount is a way out of this. And also, this is the device namespace thing I talked about, U-event injection work. So we made it possible that you can inject U-events into another network namespace. So for example, let's say you plug in a USB device inside of your computer and you say, okay, this is going to be safe to delegate to a container. And then you inject it into a container, which you can do using mount propagation and so on. But for the container, it actually doesn't appear as a proper device because it never gets a U-event. Because U-events are restricted, technically not, get to this in a second, to the initial username space. So what we made possible is you get the U-event on the host, you can parse the U-event, you strip off the sequence number, you inject it into the kernel. The kernel will append a new sequence number, and then if you have the right permissions, which is cap net admin in the user, owning username space of the network namespace, we'll relay it into the other network namespace, at which point U-dev running inside of a container, for example, will get notified, oh, there's a new device that just showed up. So it's, as we like to call it, it's device namespaces from user space. And 4.18 finally saw un-privileged, finalizing the infrastructure to do un-privileged mounts, or as I like to call it, getting away with regressing user space, because this is where we changed, I said before, cap make-not was always checked against the initial username space. So if you tried to do a cap make-not, a make-not, the kernel would look, do you have cap make-not in the initial username space? You don't, then, no, it's not possible. But right now it's, if you mounted a file system, that you're on, so if you do a data mount, tempFS, tempFS, slash mount inside of a username space, the kernel will record what username space this has been, has been the mounter of this file system, and it has been, for example, an un-privileged username space, tempFS mounts are fine, and then you do a make-not, the kernel will now check, do you have cap make-not within that username space? The answer will be yes, you create a device-not, but the way un-privileged mounts work is that, at the same time, when you mount the file system as an un-privileged user inside of a username space, sorry, inside of a username space, it also sets the SPI-nodef flag on the super block, which means any device-notes that you had prior to mounting this file system, or that you create after mounting this file system, you will get an E-perm when you do an open, which, given how container runtimes work, they always assume if a make-not is not possible, then you should do a bind-mount, but if the make-not succeeds, then it's a usable device-node, which is not true anymore, so actually, system-deservices in username spaces, and also a couple of container runtimes have been regressed by this, but I talked to Eric about this, he said it's fine, it's probably not a lot of users, so. It seems okay, nobody complains so far a lot, and it's actually something that needs to be done at some point anyways. The fun part is just that I talked to Leonard, and Leonard refused to fix it in SystemD, because he said it's a kernel regression, so you can choose where the problem lies, I don't know. Also, it enabled un-privileged fuse mounts, finally, this has been work, a lot of work done by Eric, a lot of work also done by Seth and a bunch of other people that tried to upstream it, because Seth didn't have the time. Right now, you can mount fuse from non-initial username spaces without any set UID trickery or something, so that's one of the few file systems, I'm not sure that it's gonna be followed by a lot of other file systems because VFS security is different from actual file system security, so meaning the VFS can do all of the permission checks at once unless the file system maintainer, and please yell at me if I'm wrong, gives you a guarantee that we are safe from attacks in the face of a malicious image, file system image, this is probably not going to happen, and I'm pretty sure that most file system maintainers would not feel confident to enable un-privileged mounts. And we did some work around user U-event namespacings because we figured out that this was broken, like massively broken, so let's say you plugged in a device, basically what happened is that the U-event got yelled into each network namespace on the whole system, but if the network namespace was owned by different user namespaces in the initial username space, the UIDs and GIDs that this event came with were not fixed up, meaning if you were running U-dev inside of a username space, then it will just discard the events, but it would be a slew of totally useless events, which is also kind of, it's funny insofar as the list of U-event sockets for a long time was global before we did this work, meaning if you took a lock, it basically walked the list of all network namespaces in U-event sockets, held this lock for as long as a U-event was sent into each network namespace. Then for each network namespace, walked the list of multicast sockets that were listening to the network namespace, and then and so on, so there was actually quite pointless work that Colonel did for a long time, so that's gone now. Current patch sets that we see floating around right now are a bunch of interesting patch sets, so there is the idea to introduce a new time namespace, not a new one, to introduce a time namespace, which is obviously a big thing for Creeu for various reasons, and Ray Bacon has done a lot of this work together with someone whose name I unfortunately right now don't remember, and I think maybe it lands, there is some, it's obviously going to be very, I think a long discussion, given that time is something that shows up as relevant very early on in the boot process, so changing how time works is a big thing. Thomas Gleik's now also commented on this patch set, so we'll see where this leads. Alexa worked on a patch set, revived the patch set by, I guess, David Dristail and Elviro to restrict path resolution, so it's similar to, I guess, the unveil idea a little bit, at least, of the BSDs. You have a bunch of ad flags at beneath, at no proclings, at no sim links, at this root xdev that basically lets you specify how you want to resolve a path, which is a big security feature, for example, also for container runtimes, but it has uses beyond that. We'll see where this lands. I think that Alexa right now has changed the idea to why not make it a separate syscall resolve, and I don't think that's going to fly, but we'll see some version of this patch set will likely land, because a lot of people actually want this. The links is where the discussion is taking place always, so you can feel free to comment. And this is going to land in 419, no, in 420, actually, query peer network namespaces, again, by sending along a network namespace identifying property. If you put this together, RTM get other and RTM get link, and retrieve information for network devices and their addresses for, let's say, a thousand network namespaces, it actually cuts the time in half that you would need, if you would do it with set an S, a tag like retrieve the addresses, and then set an aspect to the host namespace, so that's pretty good. And there's obviously, David's around here, right? Ah, yeah. David's done an incredible amount of work around the new mount API, that's pretty cool, and hopefully it's going to land soon. There are a couple of discussions still to be had. It's obviously, it's a big change, and a lot of people have a lot of opinions on this, but it allows you to really do nice things, that basically, one of the ideas is, for example, you split the basic concept is, get me an FD for a new mount point, configure the mount point, and then apply it, which is really nice. You also can send around file descriptors for mount points and so on, which also means you can probably make unprivileged mounting safe, if you wanted to, and so on, that's really promising, actually. And future patch sets, mostly stuff that I've been thinking about or working on just, because I don't know what a lot of people are working on right now, I think Eric is buried in refactoring signal code, at this point, so there's not a lot of namespace work coming from him at the moment. One thing that I've been lacking for a long time, and I've been talking to David about this, is recursive read-only bind mounts for the old and new and mount API, because you cannot do this right now. So right now, let's say, in user space, you want to bind mounts your whole CISFS mount tree into a different location recursively and make it read-only. So you do mount, no, mount R bind slash comma RO CIS to mount, that won't do what you think it does. It won't make the mount tree read-only, actually. The same as if you do remount R bind RO on a whole mount tree, it also only will remount the top most mount read-only. Would be nice, especially for system managers or in its processes like system D, if you could say, R bind this whole mount tree read-only to a different location, make it atomically in one shot so that we're on the safe side, and also the same for remount. But it's obviously tricky to kind of figure this out. Correctly, I have a patch set for this, which I have been sitting on because I want to make sure that it works because I fear if I send it out and L doesn't like it, then this was my one shot. So yeah, and the new mount API hopefully will see, like will apply, I guess, all mount properties recursively right away, but David is better equipped to talk about this. I've been looking into making the UMount sys called reversible together with Rumpi. He's probably not here right now, but he's at the KVM forum by reusing a concept that Eric once introduced, which is basically TuckedMounts. So right now, I think if I'm not mistaken, yell at me if I'm wrong. If you do a mount operation, you can, given mount propagation, you can get into a state where you do a mount and you do a UMount and you think you get back to the original state of the mount tree before that UMount, before you did the additional mount, it's actually not the case. You can do a UMount, but your mount tree looks totally different right now. With TuckedMounts, that's at least Rump's idea is you can make it such that each UMount gets you back to the prior state of the mount tree. So we'll see if that actually works out, still some discussion to be had around this. I have a patch set to make mount propagation in the SteadFS sys call possible such that you can do a SteadFS on a mount point and then you can check for MSPrivate, MSShared, or MSSlave in the Flags argument because you cannot do it right now. You need to pass PROC mount info, which is kind of annoying. That's the only way to get this type of information as far as I know right now. And this is something I have put on the side and I'm not sure if we're still going to need it, but basically introduce two new Iocdols to the namespace Iocdols that we have right now. One is NSInnit, which allows you to answer the question, is this the initial namespace which might only make sense for PIT and username spaces because I don't think there is a nice way to do this reliably right now, especially when you don't have slash PROC mounted. And also, NSAccess, which would basically be given this file descriptor to this username space and given a file descriptor to this file or this device, do I have privilege over this file or device inside of this namespace? Yeah, I guess I have run over time, I hope not too much, but that's basically it for the namespace and capability subsystems, so happy to take questions. Yes. Michael. Hi, just curious, I'm Michael Kersk. Just curious, what is the use case for NSInnit? The NSInnit? NS, yeah, NSInnit. Yeah, for example, it's mostly useful for the username space when I want to determine am I in the initial username space because then I can infer basically what operations I can perform or can't perform. Like if I'm in a non-initial username space, I'll be restricted about a lot of things. If I know I'm in the initial username space, I know that I can basically do everything I want. Oh, now I'm scared. David Howells, two comments. First, you're doing recursive read-only bind mounts. If the OpenTree system call comes in and we add mount setattra, which will have a recursive flag, you can just do OpenTree setattra recursive to just change read-only flag and then mount it, which gets you the read-only bind mounts. Yeah, so you're saying this will be in a new mount API? Once we've added the mount setattra, so it's called which is currently lacking. Yeah. It's something we need to add, but the two bits on either side exist. Yeah. You can just clone that. Currently, you can do bind mounts by doing OpenTree clone, which clones that and then you move the new mount somewhere else. But you'll be able to do a step in the middle, which changes just the read-only flag on all things. Because currently with the binding thing at the moment, you have to set all the flags on everything because it will have a mask. Yeah. You won't actually need an MS-REC read-only mount flag because there'll be another way to do it. Oh, you say I won't need that. You won't need that. Yeah, in the new mount API. In the new mount API. Exactly, yeah, hopefully we don't. Like this is basically, sorry, I should have been clear about this. This is a clutch, like I didn't talk about this. So MS-REC read-only, why the hell do we go with MS-REC read-only? Why not MS-REC only MS-REC? Well, not regressing user space. So probably anyone, right, my argument in the initial discussion was anyone in user space who specifies MS-REC slash MS-REC only wants it to apply recursively and wants this to be read-only and they're not getting what they want right now. So failing right now would probably make the world a whole a lot safer place. But it will also break a lot of workloads. So the idea was, okay, we cannot do this. We can maybe print the warning, maybe, but going forward, the only way to do this is by introducing a new flag, MS-REC read-only, which has its own problems, which is why I'm sitting on this for so long. Since we have how many mount flags, we have 31 bits, but all of them have been used. Even though, like, I guess five, or some five flags or so, actually mount internal, the super box flags. Internal flags, but they're actually listed in the UAP. Yeah, they're not in the new mount API, I saw that, but in the old mount API, you, for example, have MS no user and MS, I don't know, sub-mount or something exposed to user space, which for MS no user, this is the flag I'm reusing because you get Einwahl anyway if you pass it right now. Yeah, but with the new map, when we eventually add mounts, et cetera, you'll give it two map or two parameters. One of which is the set of flags you want to set so clear and the mask to say which of those flags to apply. Exactly. And there'll be a thing to say, do this recursively. Yeah. The other comment I wanted to make is you had a thing to get me the namespace of that socket. Yeah. We also need to get me the namespace of that file. So you can ask what the namespace of a particular network, file that's on the network file system is, so you can do some operation in that network namespace. Yeah. Something we will need to add at some point. Yeah, it's basically a generic IOCTL in essence, you guess. Something like that. Well, you might not be in IOCTL because you have to be able to do it on like a sim link. Yeah. Any more questions? I guess we're running late. Okay, so in this case, let's thank the speaker. It was really interesting talk.