 Hi, everyone. So I'm Stefan Kraber, Christian Branagh. We both work for Canonical on the Blackstee team. So we do container development in both user space and Canon space. And today we're going to be looking at containers and how you can build them, what are all the components that you need, and how everything fits together. So first thing, what are containers? That's always a bit of a funny topic because it's kind of weird on Linux, frankly. No more concepts that containers are effectively isolated systems. They behave like virtual machines, but they share a kernel of the host. That's a general concept of containers. Where it gets weird is that there is no such thing as a container in the Linux kernel. You can go and look as much as you want. Maybe someone would have mentioned container somewhere in a comment in there. But there is no container struct. There's no container, anything really in the kernel. We instead have a lot of different components that user space will randomly pick and put together and hope that they end up with something that's a container. You really can't get a single handle on something that is a container, not even, at least not right now. Yeah. It's not a container. Yeah, it's going to be some kind of a ported idea thing. Yeah, that's been a bit of a recurring topic. There's been some people who want to be like, I've got this process. What container does it belong to? There's really no such thing. So people have been coming up with their own solution. The audit people have been pushing pretty hard for a while to try and get such an identifier. And it's going to be an audit-specific identifier. That's going to end up being. Looking at some of the common components for containers, we've got them three different chunks. The first thing is mostly for isolation. So the most obvious thing is you want to have your own file system, like be it another distro or some container image or something. You just do that with either chariot or pivot rate. We go into more details on all of those afterwards. Then you use namespaces to let you have your own host name, your own mountain trees, your own view of processes, your own network stack, IPC, view of the C group, hierarchy, and even user. User is a bit weird because it is technically part of isolation, but it is also part of security. It is effectively the main security feature that we're using for containers these days for those not doing previous containers. Anyone who attended our talk back in San Diego probably remembers what we think of previous containers. Don't use those. Now, at the security layer, there are a few things we can do. One of the things that we often use, even if only as a safety net for unpourished containers, are LSMs. So that's Obama, Selenax, Mac, or your own. We use SecComp to block a number of pretty dangerous syscalls, especially those that have been linked to security issues. And you can use capabilities to drop a set of capabilities for the container, even just only retain a very specific set of capabilities to even further isolate your container. And then you've got resource control, which is, again, cannot separate from all of those. So that's what C groups do. It's mostly about avoiding the neural service type attacks, but it's still pretty useful. And basically, all of these components are optional because it's a user-space fiction container. So you can choose what you want to use to define a container. But since we have a particular focus, when we talk about containers, we usually treat it as like you would any standard Linux system. When we mean containers, we mean you boot an init system and a normal Linux distribution comes up. So for us, a container is you combine all of these features, essentially. All right. Yeah, and because of LexD, what we use is literally all of those. So in this tutorial, we're going to be going one by one through those and show you some examples of how they work and also some of the issues they might have to give you an idea on how you can effectively assemble your own containers. But first, this. So yeah, there are a few things we need to kind of say. So yeah, in general, you don't want to rely on privileged containers. We've talked at length about that back in San Diego. You also do not assume that you can look at the code of an extinct container manager and just optimize it because it looks like they're doing something that's slow, like setting up namespaces in two chunks instead of doing one big chunk. There are reasons why sometimes you need to do such weird things. And if you don't, you might end up with a few CVs on your project that you don't really expect. If it's complicated, there is usually a CVE for it. Yeah. There are a lot of interactions among a lot of the components we're going to be showing that also cannot tricky. That can include interactions between namespacing orders, second policy, and some ways of bypassing them, especially when you would combine things with Ptrace, which makes things always so much more fun. ProcFS, CSSFS, all the permissions on those file systems and how those get set up, again, depending on what order you've put everything together. Most of the existing container runtimes have learned that the hard way. Some of the stuff we know because we implemented the kernel features, but some of the others we know because we had to deal with critical CV. We had to fix in a rush because, whoops. So you probably don't want to be like that, and it's much easier to use something that already exists. That being said. Right. So I think this was one of the first bits we talked about, is file system isolation. Usually, again, this is probably specific to, again, if you think about a container being a fully separate system, then you usually also want to have a separate root file system, right? The same way you have it for a VM. You have a VM image when you want to start a container, you wanted to have its own view on the file system hierarchy. And there are different, well, the grandfather of all this, or the ancestor is crude, right? Or shrewt, whatever your fancy. Which lets you change into, basically lets you switch your view on the file system hierarchy, such that you have the illusion that, for example, if you truth into a certain directory, this now becomes your root file system slash, essentially. But that's horribly unsafe. And I think Safan is going to demo it in a little bit, because you can easily break out of it. And the advanced variant is pivot root, which most container runtimes use. But it uses outside of that system. He's using it, for example, to set up the initial file system when the system is booted. And it prevents a bunch of the escapes that are trivially doable with shrewt. And though it requires that you are in a new mountain name space to do it correctly, otherwise you're going to mess with your system, which shrewt doesn't require you to have a new mountain name space. Yeah, and gets your private file system hierarchy, that's what you get your own view on slash. That's basically what it is. And you should go on and demo it. Yeah, there's one restriction for those who want to use pivot root. If you are trying to use it on top of a RAM disk, like in Insignia Unitarity, that's not going to work. Your target, your new root, cannot be on the RAM disk. You also cannot do it if you're on MS shared. Yeah, does that too, because the implication of things then vanishing underneath you are problematic. So to show this one, what I'll be doing is, as an average user, I'm just going to go into more details about that in sec, but for now I'm just going to create a new username space with a new mountain name space, with a new pit name space. I'm going to remap my user to root and fork. So now I'm root, even I'm really not, just inside the username space. We've got a directory here. That's a alpine Linux root file system. So I can do the normal root spawn bnsh. That worked fine. The permissions are a bit off, but that's because I did the username space. That seems fine. Now let's mount proc. OK, that worked. OK, I don't like my truth anymore. Let's get out. And you're back on the host. So that's truth for you. Let's try that again. But this time let's use exact same unshare. This time I need to create a mountain tree for my truth. So you just bind mounted on top of itself. All it does is add a blank mountain tree at the end. It's a kernel enforced restriction for pivot root. Right. Then you CD into it. Then you pivot through it onto itself. Then we exact bnsh. OK. And say I want to do that. Oh, look at that. That doesn't work anymore, also because I did a typo, hold on. So that will work. But instead of getting back on the host, I just got back into myself. So the root in this case is actually pointing at the root and doesn't let you see to the outside. Oh, and if you have any questions, like if we, for example, use the unshare tool and you don't know what any of these options do or mean, you should ask questions. Well, I will go through those in the next slide. But you should ask questions, by the way, I mean. So that's for system isolation. Let's switch that one too. Go. Right. Name spaces. I mean, who hasn't heard of name spaces? Finally, I was excited. That's not true. I know that. You just wish you didn't. Yeah. But I think this is really the first time where I asked it. And there was no hand up. So great. So name spaces are, I like to quote Eric, who is one of the main authors of a bunch of name spaces who said that name spaces are a way to get around design mistakes or inflexibilities of Linux. And they shouldn't exist. Which he said. But yeah, they are one of the main concepts that we used to build containers. And most of them are in some way concerned with restricting certain views, access, or information that you from you. So for example, UTS namespace, which I think was the first namespace that was ever done, it lets you change the host name, which is obviously kind of useful, especially if you think about booting a system container. You probably want to set up your own host name and so on. So you can have a different host name in a UTS namespace than you can have on your host. Sorry? Yeah. I thought it was a question. And but we have a bunch more. We have seven so far. And I think there's one, two, three, four, five, six, seven. I can still count. But there is a eighth one coming up, again, from the Creo corner of the world, the time namespace. UTS namespace, obviously, what I said, isolates the host name, the mount namespace, isolates your view, or restricts your view on the file systems that are mounted on your system and also allows you, with some exceptions, to give you a private mount table, which means that, for example, if you mount a file system, let's say, TempFS, because that's possible, if you mount TempFS in a new amount namespace, this mount will show up in your amount namespace here. But you will not see it outside of that mount namespace if you haven't set up a shared mount point, because mount namespaces are horribly complex because of the fact there is something like a shared mount point. Who knows where the shared mount point is? OK, so for all of those who don't know. Of those, only one of you actually enjoys knowing about those. Oh, you think it's useful, right? They can be useful. They're also a headache. So it's like when you think about namespaces of isolating to or giving you two distinct views, right? You usually don't expect that there is an easy way to punch a hole in both of them. But that's exactly what shared mount points do. If you have marked the mount point as shared and you unshare a new amount namespace and then you mount something on top of that shared mount point or underneath that shared mount point, then it will be visible in both namespaces suddenly. So obviously what you need to make sure if you set up a container that if you don't want to leak information from outside into the container or from inside of the container to the host that all of these shared mount points have been remounted such that they are marked as private such that these two namespaces cannot affect each other. That's actually, yeah. So they're horribly complex. The username space is one of the most essential namespaces because it's the only one that is really concerned with isolating the privilege concept on Linux. So for example, if I unshare a username, if I unshare a UTS namespace or a mount namespace, nothing changes with what I can actually do to the system. I might not have access to certain aspects of the system such as I cannot get to certain mount points or file systems, but in general, it doesn't stop me from calling sudo reboot and then I'm rebooting the whole system, for example. Or because if I'm rude, I'm still real rude. If I have a capability, I have that capability for the whole system. And the username space isolates both UIDs and GIDs. So the most basic privilege concept on Linux and capabilities. So suddenly if you unshare a new username space and you write an ID mapping, then your ID inside of that username space and outside of that username space have different meanings. Let's say I'm on the host UID 1000. Now I create a username space and I set up a mapping that maps my UID 1000 to something completely unprivileged on the host, let's say 1,200,000. That's my new UID on that UID 1000. It doesn't get mapped to, which means I don't have any privileges at all anymore. And but from inside of the container, I can still make it such that I appear to be rude. So if I do inside of the container IDU, I will see zero, whereas on the host, my process is actually running with a completely unprivileged UID. Such that when I break out of that container and I'm now on the host, nothing will happen. Or I can do nothing at all, essentially. The same way capabilities for username spaces, the capabilities are now charged against my username space and not against the host anymore. Meaning if I, for example, have Capsis Admin, then I have Capsis Admin in a username space and not Capsis Admin on the host. So it's vitally important to use username spaces because they give you a really massive amount of security. And there are more network name spaces. You isolate your view on network devices. IPC namespaces give you private inter-process communication and C-group namespaces isolate your view on the C-group file systems. But I think the best way to all illustrators is if Stefan does it. Yeah, and the other thing we can just mention is you can set up those namespaces at process creation times through clone. You can set them up afterwards by unsharing some namespaces, which is mostly what I'll be doing from the shell afterwards. And you can join them using setNS. So you've got the way of really messing with them and they are visible through PROC. So you can for any process see what namespace they're in. So if we look at that. So the first thing we can look is at the unshare command, which is covering most of what the kernel unshare syscall lets you do. So it lets you unshare all the namespaces we just went through. It lets you fork, which is required for the pid namespace. Because when you unshare a pid namespace, you don't want your processes current pids to change that would cause a lot of problems. So only its children are gonna be in that pid namespace. So that's why you need to fork. The dash R option I used earlier is literally so that my current user, which is some random number on that machine, gets mapped to zero inside the container so that you're rude. And the rest is to control some of the month propagation and group restrictions. So we can look at my current processes namespaces. So you can list pox self and s and you'll see the I node of each individual namespace in there. So for now we'll just remember that 837 is the end of the I node of user. And we can unshare. So you want to share the username space, map root and fork. And you see that it's now one to three. So the username space has changed, the rest hasn't. Now we can unshare a month namespace, which by itself doesn't really get you that much other than now even time like, I've never used to do anything. I'm not real rude at all. Now I should be able to mount random stuff. Only trusted five systems, not everything obviously, because that'd be terrible for security. That's unsharing network namespace, to which point we can see only a loopback device in there. And lastly, let's do a PID namespace and fork, because it's the PID namespace. And mount proc. And look at our process list. And my shell is not PID one in that namespace. Let's say I want to change my hostname. Well, that's not gonna work. It's a bit confused. User space is a bit confused sometimes, because it's like, you must be rude to change the hostname. Well, I am just not rude enough. So let's unshare a UTS namespace, then try it again, and then spawn the process. So that's really most of the namespaces at work. Unshare is a pretty nice command line tool to play with those, especially because you don't even need privileges to really do any of that. Yeah, could you do me one favor and start a Lexi container and then show one of the process running as rude from the host? Well, I can do that with just unshare. Yeah. So if I try to use a namespace, I remind my fork. One thing I can do first is if I just touch random file. So I'm touching blah on in slash temp, but now if I get out and I look at who owns it, it's my own user, because it's just my own user is mapped to rude. Same thing, what's my process ID? Okay, so it's that thing. Let's switch terminal to another one and just drag for that process. And we see it's running as my user. And if you do IDU inside of that unshare? Yeah. Oh, from inside, sure. So from inside, it was temp blah, so it's on by root. And as far as ID is concerned inside the container, you are rude, but both user and group are mapped to my normal and previous user. So this whole mapping concept always, when you theoretically talk about it, it always sounds very complicated, but that's basically it. It pretends that you're rude inside of that username space, but from the view of the system as a whole, you're just an unproved user. Right, one of the classic security features of Linux that predate containers probably by a long shot is Seccomp, and most of you are probably familiar with this as well, I guess. Seccomp allows you to restrict what Sys calls a given process is allowed to make. So usually for unproved containers, so containers using username spaces, the security given to you or guaranteed by the username space itself is for the most part sufficient. So you actually would not necessarily need Seccomp, but for good measure, we still do it for a couple of Sys calls, right? Open by handle at and a bunch of other crazy ones. Yeah, I was gonna mention our favorite open by handle at. That one is always great, because if you've got a privileged container, it gets you the great property of letting you, I think you need to pass an FD of the path you want, and then you can open a path relative to that. The problem is it lets you cross the pivot route boundary. So say you open slash in your container, and then you ask for a bunch of dot, dot, slash, dot, slash, dot, slash, dot, slash, it's your shadow. It's gonna get your handle onto the shadow of the host so long as the file system backing the container is the same file system that's backing the host. So that was pretty bad. There was a CV against Docker for that particular one. So I think pretty much everyone blocks it in their second policy at this point. Forcing on to that. Anyway, at least for those that are using privileged containers for sure. It does not apply to any privileged containers. If you use a user namespace, that the dark is just not a thing. Yeah, and you can usually decide for a white list or for a black list, and there are, I mean, you can use very basic SecComp that restricts you, that doesn't allow you to do a lot of fancy stuff, but then SecComp also has, for example, a filter mode, and then there's a nice user space library that lets you interact with SecComp, and then you can do fine-grained system call filtering. So, for example, you cannot just say the easy thing to say is I don't want my users to be allowed to do make-not calls. For, but in general, maybe you want to be able to, for example, create sockets or pipes, which you can do with the make-not-sys call, or certain character devices, or certain block devices. So you basically want to tell the kernel, don't block all make-not-sys calls, only block make-not-sys calls that have specific arguments. And so SecComp filters allow you to do that as long as the argument is not, is passed and registered, and it's not a pointer. If it's a pointer, then you cannot filter on it, which means you cannot, for example, say restrict the mounts, this call, but only if it starts with this path, that's not possible with SecComp, and for good reasons, that we have LSM for that, for the most part. But so you cannot find great filters, for example, for make-not, you can make it, you can tell SecComp only make-not-sys calls for block devices, but allow character devices, and allow pipes and sockets. It's also something that we do. For unprivileged containers, we have even expanded SecComp quite a bit. A good friend of mine has written a patch set that allows you to intercept sys calls and delegate the decision whether or not that sys call is supposed to be successful to user space. So you register a filter, say intercept make-not-sys calls for all character devices, and then the kernel traps that sys call. That message can be forwarded to a more privileged user space process, in this case, usually the container manager. The container manager can then inspect the arguments for that sys call and then make a decision and go on and to tell the kernel, it's okay this sys call is supposed to be successful or this sys call is supposed to fail. Now the crucial step obviously is, especially for unprivileged containers, that any make-not-sys call will fail, right? That's because imagine you could create block devices or character devices inside of a container, then you could, I don't know, create some random character device and get access to all of the host memory. I mean, it's easy to crash the system with this. So make-not doesn't work, the kernel will not allow you to create device nodes. But often, especially when we boot system containers, or I guess in general any container, we usually need a set of devices, def null, def fold, def zero, def random, def view random, def console, because user space expects these to be available and for that to work. And what we do usually is we bind-mount them in from the host, so we already consider them safe. There is no known attack vector that you can gain through these devices. But we need to bind-mount them in because we cannot create devices inside unprivileged containers. So the second notifier lets you get around this, so because you can intercept a sys call and the sys call is blocked, the process making a sys call is blocked, and so you can emulate, as a container manager who's usually a more privileged process, you can emulate the sys call in user space. So for example, you can go into the file system of the container, create the device node. If that is successful, you then tell the kernel, okay, I've succeeded in emulating the sys call, please report back to the process that the sys call actually succeeded. It's a very powerful mechanism that we're using, that even lets you expand what you can do with unprivileged containers in a safe way, I would say. Okay, so let's play with second time a bit. All right, so let's first create, just set up a namespace quickly, so we want user namespace, amount namespace, I want root and fork, okay. So I can mount stuff right now, because I've got a amount namespace, I'm root in that username space, it owns the amount namespace, and so I can mount stuff. Now I've got this piece of code here, which is using seccomp. It sets up a seccomp filter, which catches the amountsys call and will have it blocked and returning eoano as the error. So that ends up being a seccomp binary, and the binary calls bash at the end, so I can just run it. Now I'm in a sub channel, that's got that profile applied. Let's try to mount, and there we go. So it's now being blocked and it's returning the weird error codes that Christian came up with. Why, isn't that very common? No. So seccomp, that's like the most common thing, you would obviously block any of those, like if you're running a previous container, you would be blocking any of those syscalls that we mentioned earlier that are not particularly nice, yeah. Yeah. Then you haven't blocked them out. I know, for this demo, I obviously did the easy thing, and Justin didn't even go to lipseccomp, I just wrote a bit of BPF, but there is, we have users, we have users, for example, I think Chromebook is a classic example, they have a ARM64 kernel running an ARM 32-bit user space and other crazy stuff. Yeah, then runs an ARM64 bit VM. You can stack user space, so. Yeah, something that's changed, like ARM64 bit host kernel with ARM32 bit user space, that runs an ARM64 bit VM, that runs an ARM42 bit user space, that runs an ARM64 bit container. But that's not, the interesting case is, we do have the logic in the Galaxy to generate the policies for both personalities to avoid those kind of issues. We have to do this because we sometimes have 64-bit kernel, then 32-bit user space, which runs another container runtime, which then loads a 64-bit user space, and then you can have another container runtime running a 32-bit user space, and so you need to load all of the compatible architectures, like seccomp filters, so that you really block all syscalls because otherwise you might end up. There is other weird stuff, like, really complicated, well, complicated, but there are really corner cases to think about. So for example, I talked about the seccomp notifier, which is my new favorite toy and example to illustrate how crazy things can get. Jan actually pointed this out, I think, on the mailing list when we implemented this. So the seccomp notifier also lets you continue syscalls, so you can load a seccomp filter, you intercept syscalls, and then you wait for the container manager to make a decision, but by the way seccomp is designed, the latest filter that was loaded the latest is always the one that takes precedent, so somebody could load a seccomp notifier filter that asks for permission to continue a syscall before your seccomp notifier filter triggers, and so you continue a syscall, although you may want to block it and so on, so there are really weird tricky corner cases. So don't write your own container manager. Oh, right. Capabilities, ha, yeah. It's usually the, technically this is something I didn't know for a long time, technically this is like, you can consider it an LSM, but it's one that's always enabled, right? Hmm? Yeah, and capabilities on Linux are a kind of weird complicated beast, like I'm not going to explain how they are calculated. So if you, they were one way of splitting up the root privilege, I guess that's the idea, and I have all the LSM people here, you can yell at me right away, yes? Okay, dammit. So to split up the root privilege, I mean root was able to do, is technically able to do anything it wants to, I mean we recently had the lockdown patch sets and so on, so even that kind of is not necessarily true anymore, but if you, often you might want to delegate specific privileges to unprovened users, and the big hammer is obviously to give them root temporarily, like pseudo is, okay? Interesting, so, okay, so even two sides, to some extent, so yeah, okay, so Casey's point was it was more about the original motivation was so that privileged programs could drop all of the unnecessary privilege and only retain the privilege that they needed to perform a certain operation, so that for example, I guess when you had, when you had a binary that needed to do something critical and it got corrupted or there was a bug, you couldn't crash the system for example, but the other, also the other way is that you can, you can also use it to delegate privileges to unprovened users, if needed. Earlier this was usually done by security binaries and they're insecure and so on. So we have a bunch of them. The biggest one is Capsis Admin, I guess the, Michael, should I give you the mic? So 45% of all keeper calls are Capsis Admin, yeah. So Capsis Admin is the, well the joke is it's the new root, so splitting it up has partially worked, not completely, but you have privileges such as CapMakeNot, I think, which lets you restrict, if you drop it, you cannot create device nodes, CapsetUID, and a bunch of others, CapsisLog, you know that one? Yeah, I'm playing with that one right now, yeah. So that you can read the message output and a bunch of other stuff, they come in one, two, three, four, five different sets, effective inherited permitted ambient bounding, let it be known that for all, you only actually care about effective right now because that's what the kernel uses to check whether or not you can perform an operation and permitted is regulating which capabilities you can gain, but it also interacts with the bounding capability set and inherited is how you, I guess, inherit capabilities across XABE, but that doesn't work correctly, so you have ambient capabilities which are a way to get around that and there is like, I'm not joking, almost cursed, but I won't, if you do man capabilities and you look at how capabilities are calculated, you will see set theory. Like, you know, the intersection of joined with, it's fun, so it's a rather complicated and then you also have file capabilities which, as I said, are an alternative to setting the setUID bit, so for example, you can say, you can set the CAPE MacNod file capability on a given binary and if you execute that binary, that process will gain the CAPE MacNod capability and then can create device nodes. The interesting part about capabilities is that, as with most privileges, they originally were only charged against the initial username space, so asking for example, do I have the capability to create device nodes or to mount something was a question like, yeah, was a question in general on the system, do I have the capability to do this, but with the introduction of username spaces, capabilities are now charged for the most part against username spaces, so capabilities have an owning username space, so instead of asking, do I have this capability, you're now asking, do I have this capability in the username space that I'm currently in? So for example, I can ask the question for the initial username space, can I create device nodes, do I have this capability, the answer is no, if I unshare a new username space, because it starts out with a full set of capabilities, the answer will be yes, you can. The problem is, and here it gets even a little bit, yeah, it gets kind of nasty, is a lot of capabilities still ask the question, so for a lot of capabilities, Kate MacNod included, you still need to ask the question, do I have this capability in the initial username space, not in the current username space, because otherwise you could attack the host again. All right, yeah, and I'm gonna try and show you a tiny bit of that, just showing you that one, and there, whoops, and this one, there we go. So you can use Kapp.sh to get, to figure out what your current capabilities are. I should actually get out of that, sorry. So as an unprivileged user on my laptop, I should have nothing, it's good, I've got nothing, but as Christian said, capabilities are tied to the username space, so when you ensure a new username space, you're gonna see quite a different result, which is literally every single capability. They're all tied to the username space, you don't have them against the init username space, otherwise you would have a very, very big security problem. But they're there, so one thing I can show is, right now I, did I pass N, I did yeah. So I can create new network devices, it's great, it's there. Now I can do Kapp.sh, drop, cap, net, admin. So that's points of subchannel, that's got that capability dropped, and you're not gonna be allowed to do it. And obviously, you don't create it. I didn't actually try that part, but I'm pretty sure it should work. So now if I, oops, I forgot that, so if I create a sub username space now, you've got that interesting property where you get all the capabilities again, which includes Kapp.net admin. Even though I just ran it from the shell that dropped it. There are ways around this, to be honest, you need to drop it from the bounding capability sets, even I can't remember. But to make things slightly more confusing, you notice my command is still failing. Well, that's because the network name space is owned by the username space in which I dropped that capability. So even though it looks like I've got it, I don't actually have it against the right name space. Sorry, it's still blocked. Right, so it's the same way, I should have mentioned this before, I'm sorry about that, the same way capabilities have an owning username space, all of the other namespaces also have an owning username space. So when you create different types of namespaces, the order in which you create them matters. So if you, for example, create a new mount namespace, but then unshare a username space, the mount namespace was created before the username space and so still belongs to the initial username space. So if I try to mount something in that namespace, even though I have technically am owner of the current username space, I can still not mount anything because the capability is still checked against the initial username space which owns that mount namespace. So I need to do unshare username space and then unshare mount namespace, in which case the new mount namespace is owned by the correct username space and now I can mount, actually that's trivial to show, but I don't know if you wanna do it, but. Uh, mess with doing them in different order? Yeah, if you do unshare the mount namespace first then unshare a username space. Well, I need to do username space first because I'm not sure it's, but. So I need to do a, so I can do, so you want username space because otherwise I can't do a mount namespace, then now I'll do a mount namespace, then now I do another username space. For example, yeah. Yeah, so now my first username space owns this mount namespace, but my new username space does not. And it should still work, right? Yeah, it's still gonna work because I'm pretty sure it's still gonna work. But they're still the same user. Oh, actually not. Yeah. Yeah, it's correct. Yay. Yeah, there is always it's interesting to get the relationship. So as we were saying, ordering matters with those things. Yeah, so when we, for example, I have a diagram later on, I hope we can still make it where you see, where you see how a container is actually started. And then you will see a bunch of namespaces are started or are created right when we create a new process. So we fork a new process within a set of new namespaces, but some of those namespaces, we can create right at creation time because we need to do preliminary setup steps before we unshare the namespace and lose privileges to do certain operations. And then also the ownership between the namespaces matters, as we said. The next security modules, well that's one you will probably take. You're way more experienced with SC Linux and app armor. And also I don't want to get yelled at. Mostly up armor really. Yeah, so for especially if you're on a privileged container, you do need to use the RSMs to try and make things vaguely sane. For on privileged containers, they're not strictly needed, but they are good safety net. And they are nice if you can use them inside the container to then properly isolate the applications inside the container to something we'll show with app armor. As I'm stacking, we don't need to go into a lot of details because John covered that part. We'd very much like all of that to be mainline and work rate. That'd be amazing for us, but it's been slow going and there's a lot of complexity. It's something we obviously look at quite closely. Right now, the main thing we use on our side is app armor, which does support like a inside app armor effectively namespacing mechanism, which lets us load a profile outside, apply it to the container, make it look like inside the container that it's got a clean slate and then let the container itself load additional profiles just like if it was running on the host. That'd be great to have for all the LSMs, but that's still a way off. And for unprivileged containers, as Stefan mentioned, it's more or less optional. It's an additional safety net because unprivileged containers and especially with the username space are considered to be safe by default. So if you can break out of a container that is not LSM confined, which is unprivileged, then it's probably, it would be a CDE, it would be a kernel bug. Yeah, usually if you can escape on an unprivileged container, it also means that you could get root from a normal unprivileged user on your system. So it's usually pretty bad. Yeah, but for a privileged container, so UID0 inside and outside of the container mean the same thing, you will have to use LSMs if you wanna have any kind of safety guarantees. Well, I guess that the one exception to that would be if you were to never have anything running as root in the container. If you're running entirely as unprivileged user, then it's not quite as bad. But if you're gonna have anything running as root inside a privileged container, then LSMs are pretty much a must. Now for the more part of this one. Switch, yeah. Okay. So, my favorite running Ubuntu is we've got our armor on it. If I look on my current process, it is not confined, so it's running unconfined there. We can use a existing loaded profile and switch that even as an unprivileged user. We obviously can't load additional profiles right that way, but we see that that particular terminal is now under that profile it's being enforced. To show something slightly more interesting as far as how that stuff works, and that's a bit painful to just do from a straight shell, so I'm just gonna spawn electricity container. Okay. Now, if we look at that container, we get the PID of it's in its system. And let's look proc.pid attribute current. And we see that super confusing string. That's because, first of all, we like using complex names for our profiles in lexd in general, because we don't, we want it to be possible to run multiple lexd instances on the same system, especially for our testing environment and not have the profiles ever clash. So the path that's used for lexd storage isn't coded in the name of all the profiles and namespaces, which makes it quite a bit longer. So, lexd.c1.value.xd is a namespace. We also have a lexd.c1.value.xd profile, which makes that string just that much more confusing. And we see at the end that the profile that's gonna be seen inside the container is gonna be unconfined. So we can go look at that as it's c1. So I get a root shell in the container. If I go look at my view, I see unconfined. But you'll also notice that that particular container has a bunch of profiles that are currently loaded. It's not really easy to show, I guess, but like effectively doing that, now that the cp.dump process are checked. I guess we can check it from there. If we go look up, actually just go back in the container. So we get another shell in the container and we're gonna look at that cp.dump process and it's confined with a profile that was loaded from inside the container. So that's the state of things with our partner. We're hoping to get that and more with all of the others so we can start mixing, matching and having a few containers running on a Selenax, a few containers running on a Parmel. That'd be great for us, because we obviously build and provide images for things like CentOS, Federa, et cetera, for the container images. And right now when you run them on the Ubuntu server, they're gonna get an Parmel namespace and not do much with it. If we could set up a Selenax in the container, then they would be able to have enforcement, which would be nice. Yeah, and last, but not least, I think this is to some extent the cherry on top C groups, which I'm mostly concerned with resource isolation, I guess, or resource restriction. But there are some oddballs in there, actually. Oh, C groups. How many people are using C groups? That would be interesting. Oh, okay. Quite a few. I mean, the rest are, they just don't know. Yes, but actively. For example, for... Yes, you're not using systemd, but anyone using systemd is using C groups like Chrome, Rensen, a dedicated C group. But you're using V1 or V2. Oh, who knows about V1 versus V2? Okay, that's also quite a bunch of people. So resource limitation. So we're talking about stuff like CPU, block IO, and so on. And we use it for every container. So even if you... So because even if you're an unprivileged container and you create a fork bomb, for example, inside of the container, you could still exhaust all... Well, pit namespaces give you some control, but you could still create a lot of processes and then possibly prevent starved hosts of pits, especially if you run in the same pit namespace which some container runtimes do, which is a bad idea, but that's definitely a problem. But you have the pit C group, and the pit C group lets you limit how many processes you're allowed to create. So you could move your container into a separate pit C group and then set a limit on how much processes it is allowed to create. And if it has exhausted that limit, it will get, I guess, enomem and cannot create any new processes anymore. The same thing with how many CPUs is it will it be able to use? The CPU set controller, at least in legacy, in the V1 hierarchy, would let you restrict how many CPUs you could use. Let's say you had a 10 CPU system and you said, I only want this container to execute on CPU one or two CPUs, then you could move it into a separate C group and then restrict the CPUs it is allowed to execute on. So they allow you to restrict resources, block IO, the same thing. Then there are a bunch of odd balls in the C group hierarchy. The devices controller, for example, is kind of odd in the sense that it's not necessarily resource restriction, but it's permission risk. It's actually more sort of in the direction of restricting permissions. The C group devices controller lets you specify which devices a container can or can't access, sort of the same way as a black or a white list. And you have the freezer C group, which lets you freeze a set of processes. So, for example, what I tended to do for a long time, even though it didn't work correctly, was when I started Chrome, and I didn't all of the time wanted to shut down the browser and then have all of the, you know, have to reload my tabs and so on. So what I did instead, I have it in a separate C group and then I send the signal into the freezer C group and Chrome just got frozen. All processes and then later on, when I needed it, I continued it. This is also kind of an odd controller. So you have a bunch of controllers that don't really fit into the whole resource restriction picture that C groups give you. But we make heavy use of it. The problem is, one of the problems is it comes in two different incompatible versions. So it comes in C group V1 and C group V2. So for example, one of the major, the delegation model, so if you want to delegate processes, if you want to delegate C groups to a process, then the models on how to do this in V1 and V2 are vastly, vastly different. And I'm sparing you the details right now. Another glaring example is that C group V1 is a complete pseudo file system. So it means you mount the C group file systems, you mount the specific controller, such as the CPU set controller, which restricts on what CPUs you can execute. And the way you set up the CPU set C group that you just created is by writing into a bunch of files. So you always go through the file system. That has race conditions, okay, that's a problem. But it was at least nice and easy to do this. With C group V2, we now have a model where part of the C groups are file system based, such as CPU set and I guess IO and memory. But some of the C groups now are BPF based. So the device, the C group controller is a BPF controller, which means you cannot configure it by going through the file system anymore. So you see this poses a lot of problems if you think about scenarios where you have a new system that uses the C group V2 hierarchy by default, so you boot up your system. There's only the new C group V2 hierarchy and now you want to run containers that contain, now you want to run containers that run systems and especially in the system, in this case system D usually, that doesn't know about the C group V2 hierarchy or not about new features of the C group V2 hierarchy. Then you have a problem, this won't work anymore because it considers it crucial. So the problem that you have is running distros that only understand C group hierarchy V1 on a host that understands or only uses C group hierarchy V2 is a big problem, especially if you consider that a lot of people use containers to run legacy apps in there that they don't want to run on a host anymore, which so you have an incompatibility right here. It's otherwise a great tool, but we're working on ways to get around this problem. But yeah, resource restriction, C group is the tool to go. Yeah, the other thing that's worth mentioning is C groups are not always particularly nicely integrated in a kernel. So one of the things you'll notice is if you set a limit on your CPU set, so just pin to a specific number of cores and then you set some memory limits, you would potentially expect to see that in Proc CPU Info or Proc Mem Info or in tools like free, but you're not. Those files will always show you the global system resources. They will not show you your actual restrictions. Which then wrecks havoc on applications that go and look at those files to figure out what they can actually use. On the memory side, Java is known for blowing up quite badly whenever it's got this memory limit applied. It's a problem we've had for a long time and that we've worked around on our side by having a few files system called LXCFS which inspects your processes, C groups and builds virtual versions of those files that can be mounted over the real thing. So I'll be showing that too. So on the next system, you can look at the list of C group controllers that are supported on your machine as well as the number of C groups that have been created on the system. By looking at proxy groups. We should mention this is a V1 hierarchy. Yeah, and if you look at Proc self-C group then you get your path in every one of those C group controllers. In this case, we actually see both hierarchies. So all of the individual controllers are V1. But the last one at the end is a V2 hierarchy because most of the distributions have a hybrid model where the process tracking that system D does is done in an empty V2 hierarchy. So there's no controllers attached to it. There's no resource control going on there. But the process tracking itself happens in that hierarchy whereas V1 is still used for resource control. So any CPU bit memory limit you might have in place are all in the V1, under the V1 controllers. And those are traditionally mounted under C self-C group. So you can see basically one directory per controller. And if you go look inside one of those, you see all the individual files that you can write to to play limits. And then directories that have been created for your users. They're all separate. Usually they're all separate mount points. So what most Linux distributions do is they mount every controller. So every one of these controllers is usually concerned. Yeah, you see it right here. Net CLS, huge TLB devices, CPU set memory, PID, CPU, and so on are all concerns with different kinds of resources or at least let you restrict resources in different ways. And what most distributions do for the V1 hierarchy is have separate mount points for all of them. So you see- With two exceptions, there's co-mounting going on for CPU stuff and for network stuff. Yeah, so this right here is the type of controller that is currently mounted. But there is no restriction on how you can actually mount them. So you could, for example, and some Linux distributions a long, long time ago used to do this, set up a single mount point and then co-mount all of the controllers into a single hierarchy, at which point all of the files that are usually just located on the freezer will show up alongside devices, CPU set memory, PIDs, and CPU, and so on files under one directory. That model is completely gone with the C-Group V2 hierarchy. With the C-Group V2 hierarchy, you only have a single mount point and then controllers are enabled or disabled by writing to a file specific to the C-Group V2 mount. So this is really completely different just in the way they are set up. As I was saying, CPU and CPU accounting and network class and network something else are commanded. So like here, if we look at C-SFSV-Group CPU, we can see that we have got both CPU.files and CPU.account.files. So that's what happens if two controllers are commanded. If we were to do the worst case scenario and mount them all together then you'll have everything in one shot. So first we'll just try, so the usual, I want a piece of namespace, I want root in there. Okay, I'm interested in my PID, okay. So I'm switching to another terminal. I'm gonna get actual root in this one someday. And we'll create a new C-Group under the PIDs controller, so let's go to that demo. Then we echo the PID of the process. I've got training in that namespace and we'll dump that into C-Group. You wanna show your current C-Group first? For that process? Yeah. Yeah, sure. Okay, so currently the PID controller is under some system D generated thing for that process. So now I'm moving it into that C-Group. So we see it's been moved to slash demo. In that one I can now check the number of processes. So we see I've got one process training in there and my process limit is currently max. Now let's say I'm gonna be annoying and literally I'll have a single process for that C-Group. What do you think's gonna happen if I try to run any command in there now? This happens. So yeah, that's not gonna go so well. Even she's just gonna give up, I think. But we can go and be nice, so let's bump that to five. Yay. So that's one of the controllers and how they work. You could do the same thing with any of the other resource controllers and whatever, yeah. You could actually unshare the C-Group namespace and remount it, but there is also a way with the C-Group namespace that lets you restrict. Right now you still see the full path, right? You can guess that on the host I'm located in the C-Group system as a C-Group pit, what is it pit slash demo and if you, yeah. What's the one for C-Group? Oh, capital C. Capital C. Yeah, so if I look in all those C-Groups right now, sorry. Probably five is not enough, fine. There you get 10. You happy? There we go. So I'm still in the exact same C-Groups. It's just that now that I've unshared a C-Group namespace, whatever C-Groups I was in now becomes the root as far as what the container sees. Everything has been reset to slash. Note, you still now, Stefan hasn't done this, you still need to remount the actual C-Group file system because the actual C-Group file system currently still lets you access all of the whole hierarchy. Yeah, that's still gonna be pretty confused in that. So right, so if you now do a remount of the C-Group, a pit C-Group controller then. Yeah, so that was. Slash, oh, remount. Well, I do, new mount on top, does that work? Oh, I don't think I unshared the namespace. Still out. Already mounted, okay. But not that I've got the namespace, I don't care, I can't remount it. So, I'm not. You think I can remount it? I'm not sure I can. Really? Come on, remount, remount, come on. Okay. Nope. I'm not the owner of the original mount. Ah, okay. Anyway, normally we don't really have that problem with containers because we just make sure they don't see any of the C-Groups in the first place and they just mount clean copies. Yeah. Which is obviously not the case here. So more C-Group stuff we can show you, the slightly more interesting cases of, so you remember I've got that C1 container I created earlier. Now I can, right now, let's just see what's going on in there. So if I go there, I look at my memory, I've got, I see all the entirety of the 16 gigs of my laptop and we've got four CPUs. Now let's mess with the C-Groups a bit. CPU equals two, limits, memory equals one gig. And go back in there. And now we've got the limits applied. No. So as I mentioned, those files, normally you would see the entirety of the system. The only reason why we're seeing the right thing is because we've mounted like CFS and that over mounts all those files with the real thing. Now, if they weren't here, so if we're about to unmount CPU info and memory info, then even though I've got a limit in place, I see all of them. So now the software running in that container would get very confused because they're like, oh, I can use 16 gigs of RAM, try to allocate 12 and explode. Some programs really they will see, for example, you owe you 32 gigs of RAM. I'm now going to pre-allocate eight gigabytes of RAM. And then obviously if you have restricted it to, for example, only four gigabytes, gets hilarious. Yeah, another thing we can see in those files is we've got uptime that is also covered by like CFS. That's because otherwise you don't, if you were to run uptime you would see your host uptime, not the container uptime. So we can see we've got 18 minutes. So it just looks when the internet process of the container was first spawned. But if that hadn't been mounted, then we would be seeing six days. People kind of like knowing when the container started, it's useful for monitoring systems and a bunch of other things too. So that's another thing we've had to paper over with like CFS. Yeah, so this is obviously a route, like different to virtual machines. Right, and so this is something I mentioned before. It's just a real small hackish diagram. I suck at drawing anything and I suck at diagrams. So I'm very sorry. This is the best I can do. It's, the idea is just you have a container manager and you have the container over there and what usually happens is that a container manager is responsible for creating a new process. So it spawns the container, but with the clone system call in the Linux kernel, you can also specify the set of namespaces that you want that process to be created in. And what we usually do for unprivileged containers is that we spawn the user namespace, the pit namespace, the IPC and the UTS namespace together when the container is set up, but we don't spawn a network namespace and we don't spawn a C-group namespace. And the reason is that if you were to unshare the C-group namespace right away and then move the container later on to the correct C-group, the view would be totally off. So we need to defer, for example, creating a C-group namespace until the point where we have created new C-groups for the container, move the container process into these C-groups and then unshare the C-group namespace and then remount it so the view is correct. The same way for network devices, there is an ordering issue between network namespaces and user namespaces. So if you unshare them together and then write an ID mapping, then the ownership of the network devices will be off, at least for some kernels, I don't know if it's fixed in the meantime. So what you need to do is you need to create a user namespace first, write the mapping and then unshare the network namespaces or the ownership for the network devices is actually correct. So you have a bunch of different steps. The container manager needs to do some stuff, for example, create C-groups and write the limits because it's usually the process that is more privileged than the container that is created. So the container can't do the C-group, can itself move it into C-groups if it's not privileged enough given that it unshared a user namespace. It also sometimes has to do some networking setup depending on whether or not it just requires privileges or not and it might preserve the namespaces so that it can easily attach to the container and so on that all needs to be done and it all needs to be synchronized with what the container process setup, container process setup process is doing, which for example writes ID mappings right after it created the user namespace meaning it basically becomes root in its new user namespace that it created, then it unshares the C-group namespace after the C-groups have been created. It creates the network namespaces after the network name devices have been set up and then also does a sec-com because the process can only do it itself and then also does its own writes as LSM, wait? Yeah, well writes its own LSM profile and the way we usually do this is by providing barriers to synchronize barriers, a fancy way of saying for example you can do this via sockets and send messages that indicate when the container manager is done and tells the container process to go on and then finally the container process until when it finished setting up itself it execs the init binary. For our case we usually just boot system D and then you have a done container and the container manager continues to supervise this for the whole life cycle. Usually you for example poll on, you poll on the container and then you reap it and get the exit signal and so on. One of the essential, I guess one of the crucial design steps is when you design a container runtime which is you would think is a given but not necessarily is when you destroy the container manager you not, there are two scenarios, right? Sometimes you don't want the containers to survive when the container manager exits. It's usually not the common case. So for example, we use, let's say we have a long running demon and the demon supervises a bunch of containers at the same time, lexity and then it has a bug and the demon crashes. The last thing that you necessarily want is that all of your containers now go down together with the demon. So usually what happens is the containers keep on going and then you can just restart the demon. I think that's one of the main design principles. Yeah, for that we do, we've got to actually an intermediate one per container monitoring process that is gonna live with the entire lifetime of the container. If that one process dies, it's gonna take the container along with it. Even that, even- But that process is reasonably simple. It has like a, effectively a Unix circuit API that we can talk to to figure out what's going on and get some of the FDs and stuff we need to be able to attach and mess with the container. But we need to make sure that that particular API is backward compatible because you might be upgrading the tools. You don't want to restart the containers. You need to be pretty careful with those things. But yeah, we want, if you've got a full container manager that tracks a bunch of containers, you don't want that to be the actual parent of everything. This is the current, I would say this is usually the standard way of setting up containers. I mean, in a nutshell, system DN spawned us the same thing. It also runs system containers. Lexi does it this way. RunC does it a little bit. I think it's doing it a little bit differently insofar as I think Alexa has written it like some sort of complex state machine, but this is the easy way of how you would usually do it. So you synchronize between two processes. You exec, you supervise the container, and yeah. Yeah, the other thing to mention too, like on top of the ordering of the namespaces, also the ordering of the mounts, because if you mount, if you mount POC and SIST too early, you're not going to be seeing the right process hierarchy or you're going to see, you're going to have the right process hierarchy, but you're going to, like the permission on POCSIS net, for example, is going to be off because you've not, you've mounted it before you entered the namespace, so it's still tied to the host instead of the container. So that's some of the stuff you want to make sure, like all the five steps you mount that they've got the right permission for the resources you expect. If not, you probably misordered something. Right, I think that's mostly it, what we wanted to cover. And so we can do a quick recap. You want to do a couple of points? Well, yeah, yeah, sure. Yeah. Okay, so I, right here in this setup step, so the container manager spawns the process that later on when it has accept becomes the actual container. You see that on the namespaces, it lists username space. The easiest way to say, to clarify this is a privileged container doesn't do this step, but in a more detailed, in a more detailed way, you can say a privileged container is any container where UID0 on the host on the actual system and UID0 inside of the container have the exact same meaning. Such that, so if you ask yourself the question, what is if a process with UID0 inside of the container breaks out and the question is, oh my God, the world is going to end slash my computer is shutting down, that's a privileged container. If you think about, ah, if UID0 inside of my container is breaking out, then it can't do anything. I don't really, I mean, I still care. I still don't want it to happen, but it's not really a big problem, because UID0 inside of the container and UID0 outside of the container mean totally different things. That's sort of, that's the crux, that's how. So for example, even with a username space to make this a little more complicated, but I think you're all up to the task. So even when you create a new username space, you can specify for whatever reason an identity mapping. Like you can say, I want a new username space, but I want to map UID0 to UID0, I want to map UID1 to UID1 and upwards, and then you still have the problem that now UID on the host is the same as UID inside of the container when a breakout happens. So when you get that process to be actually escaped to the host, then it has all the privileges that UID0 would have on the host. So that's why we usually say as soon as you have a process inside of a container where UID0 and UID0 inside and outside mean the same thing, that's a privileged container. If it doesn't mean the same thing, then you're fine. And the way it doesn't mean the same thing, and that's what we showed before. If you look at any of those containers, like for example, a C1 container, if you do IDU inside of the container, then you see zero right here. If you look at the whole process tree for that container from the outside in the PS output, then you see that it's a completely random different ID. So look at any of those containers. C1, and then you see the whole process tree right here. Lexi, exec, bin, bash, and you see it. Or you can see it has been in it and underneath. You see it's gonna be one million as the base UID. So you see LexiMonitor, and then you see the child process, which is the actual container, slash Sbin, slash in it, and then the whole process tree. And if you look at the leftmost column, you'll see that most of these processes run with UID. What is it, one million you said? Yeah, one million. One million, one million, one hundred, one million, one hundred, one, so an inside of the container, all of these processes would run either as UID zero or one or one or one hundred. And so ID zero inside of the container has a totally different meaning than it has on the host. That's what a privileged container is. That's the same process tree as we saw, but from inside the container. And the other thing that we can see from inside the container is we can look at its map, which those maps, they're always a bit confusing to read, but what it means is that UID zero in the container is UID one million after the container, and that there's up to a billion UIDs and GIDs after that. So if I was to use like a billion and one UID, then the kernel would just tell me that doesn't exist. But yeah, you can go figure that and you can do a lot of things with those maps. You can do whole punching, so you could actually say I'm gonna map UID zero and I'm to a million outside and I'm gonna map UID 10,000 to some other number outside and nothing else. And if you try to use anything else inside the container, you're gonna have a bad time because those UIDs in GIDs will just not exist. But that also allows you to, for example, like punch a hole through for your own UID. Like if you want your UID one thousand outside the container to be UID one thousand in the container, you can do that. You'll end up with like three maps. One for the first 999 and then the hole and then after that. While we're talking details, there's obviously also the case who has paid close attention will now realize, hmm, so what about a scenario where we have a bunch of unprivileged containers running? Great, so I have the host, the host is protected, right? But I worry about the scenario where I have two different processes that I really don't want to interact with each other and I'm afraid that one of the processes might, in whatever complicated way, escape into another container. And if they have the same mapping, now you have a problem that obviously the containers, even though they are unprivileged, they can attack each other, which you might care about. If you have, for example, if you have multiple tenants that all run unprivileged containers. The way to obviously solve this is by giving each container individual ID maps. So we call it isolated ID maps. So there is an option. This is also, this is container runtime specific, but it's not magic. You can implement it, well, we can implement it. So you start a container, you start a second container and then the container manager will take care that all of these containers have individual ID mappings that don't overlap. So if a process from one container escapes into another container, it will be an unprivileged user the same way as it will when it escapes to the host. There are problems with this, obviously. So we, as a container manager, you only know about your own isolated ID mappings, meaning if some other container manager or some other process reuses ID mappings that one of your container users has an overlap, then you still have an attack vector because there is no, well, there is no nice way to coordinate, to coordinating, reserving ID mappings on a system. There may be a way in the future to do this in kernel. Yeah, we might have a way to not have to care in other space, which would be nice. That requires a lot of, that will require a lot of thinking, but that would be a nice security feature. And it's also for us, for example, it's possible because each container gets a separate root file system and each root file system can, DIDs and GIDs obviously for all of the files on the file systems get choned to the mapping for the specific container. But now think about a runtime where you, what a lot of users care about, where you have a layered approach, right? So multiple containers share the same file system layer. Now you run into the problem where you, if you want to have, all of those containers are to have separate non-overlapping mappings, they can't share layers because they can't write to the underlying file system. The way to get around this is obviously to write a file system that fakes the DID mappings on the fly, which is something that we've done, but this is currently just an Ubuntu specific patch said, we are working on upstreaming it, believe me. We just needed something because we've waited for a long time for this, but the upstream solution will look very different, likely from what we originally implemented as a POC, as a proof of concept. But if you then have a file system that translates the IDs on the fly, then you don't have this problem. So there's a lot of things to actually think about. Yeah, so as he was talking, I was just actually deploying those containers. So I've got five containers now that are running. The first two, so if you look at the bottom and you've got the first column, the first two are C1, C2, which were the ones we created earlier. They're both using a million as the base UID, so the root UID. Then I've created two more using the isolated feature and we see that they're using committed distinct UIDs as root and they don't have any overlap in their map. And lastly, the last one I created was using privileged. So we see that that one is actually running as root. Right. I know the username space is probably, especially if you haven't worked a lot with this, it's probably very confusing. It's a very complicated but powerful tool. So as a quick way to recap before we take a few questions. So containers are effectively a user space fiction. There's no such thing as containers in a kernel. It's kind of piecemeal in that you can, you take whatever components you want put together and then you might call that a container but then someone else might not agree with you that this is a container or not. That is not a joke. We said we once attended a conference, a bunch of people who all worked in different runtimes and container workspace and we argued for one and a half hours, like literally argued what a container is. There is no agreement. And yeah, as the previous slide was showing, building containers can be pretty hard. You need to do it in the right order. There's a lot of complexity around that. And there's a lot of different kernel APIs that have all different concepts that don't always all line up. So you need to be somewhat careful there. Also for second, I actually had it listed there was in my side, the architecture, multiple architecture cannot matter. You need to be careful when you generate those policies because if you do have personalities, you need to block both. Both at the same time, otherwise there's a pretty easy way to bypass your filter. Security definitely matters for containers. I mean, you're sharing the kernel. If you, but they configure a container, if you pass a device shouldn't be passing. It's game over pretty quickly. Privileged containers are not a good idea. I would not recommend anyone who starts to play with any of that stuff and doesn't use username spaces. We've done a lot of work over the past few years to try and make username spaces work for the vast majority of use cases. And the Cisco interception work allows for even more because now we can have the container of runtime, mediate and fake, Cisco's as needed to get even more privileges and in some particular cases. Resource management is not something that you should just ignore either. I mean, DOS attacks are still attacks, even though they might not be able to escape and taking your entire host down and all the containers along with it is a bit of a problem. So you want to also set resource limits to prevent fog bombs running out of memory or even things like using all of your block or your disk IOs or network IOs. That's all pretty important to set up right, especially if you're gonna be running some amount of interested code in there. Don't reinvent the wheel, like really don't. There are a bunch of libraries and tools out there. There's a pretty good chance that there's a runtime that exists that does what you want and that has already gone through a lot of that mess. So you should be using that. If you can't, then at least use the proper libraries from the different components. Libelixy itself that we use for our system containers does offer got a few pieces that you can use to at least to actually manage all of that in one shot if you want. But you can also effectively do it yourself. By doing the namespaces yourself, there's no proper library around that. But then you've got libraries for C groups like Karp, Selinex, Aparma, all of those capabilities. All of those have their own libraries that try and abstract a bunch of the common use cases and try to have you avoid common mistakes by having a library do the right thing for you. And that's it. If you've got any more questions, I think we've got about five minutes before it's dinnertime. Yes. Well, you should probably run with the microphone when you have one of those. That's a good point. Okay, so let's assume that I have an unprivileged container. I have some seccomp rules, for example, not loading kernel modules, even though it's kind of redundant in an unprivileged container. We also have some C groups, you know, the defaults basically for LXE. I know you have blocks on some of the system calls that are kind of dangerous. We also block mount and Mconod. Yeah, basically. Let's further assume that I keep up with new kernel updates so there are no known bugs in syscalls. How, now let's say that an attacker comes in and get a shell via an application that is running in a container. How can that attacker break out of the container and get into the host? And the second question basically is if that's sort of not really any theory about how he can get out, is there anything else that the attacker can do to sort of mess with the system? So for an unprivileged container, in theory, there shouldn't be much they can do. I mean, they can wreak havoc in the container, so you need to make sure that they can't renew out of resources, because the easiest thing they can do is DOS you, they will try a fog bomb, they will try to fill the entire hard disk, they will try to fill all your network bandwidth, use all the CPU, do those kind of things. Escaping other than having a kernel bug that they're aware of, that has not been fixed yet. There's not much that should happen there. The one thing to be careful on top of all of the kernel features is your container runtime might be exposing itself to the container in some ways. Like if it's passing some kind of socket or some kind of file into the container, that would be a way of attacking the container runtime itself at which point you'd be root on the host. So that's another common thing to keep an eye on. One thing that's also somewhat worth mentioning is that we've, it's not really an experiment because the terms of service is not to do it, but we have been running a feature on our website for years now, where anyone can just click a button and they get root inside a lexity container that has a container nesting enabled. The main idea being that they can play with lexity online without, before first installing it locally to play with it. But it's also effectively a shared environment where we hand over root shells to inside a previous container to anyone on the internet. We've not seen problems with that yet. We have seen, I mean, the obvious. We've pretty regularly have people training fork bombs. We've got process estimates that block start. Some people have been trying to fill the file system. Same thing, we've got quotas, that's fine. We block the CPU and memory quite strictly. That works fine too. Networking is obviously an issue too. So we only allow access to the few servers that we trust. We don't want to have people mine Bitcoin or whatever in there. Or try to attack our network. But other than that, any, in general, if for a well-designed, unconfigured, and previous container, there shouldn't be any way of escaping that. And if there is, it's a critical security issue in the Linux kernel that might also let a unprivileged user, even outside of a container, escalate to root. So those are usually treated as such. So that obviously does mean that you need to be looking pretty closely at kernel updates. If your distribution supports live patching, that tends to be pretty useful. Or otherwise, we usually have premature policy that as soon as a kernel lands, the host reboots whether the containers like it or not. Because you want to be patched, especially for entrusted workloads. To be fair, for a long time, people have been going on about rightly so, I think, that you have virtual machines and then you have containers. And if you really, really, really care about security issues, you use virtual machines. But thanks to Jan, we now can say Spectre, which is kind of the attack surfaces, I think, are different still to some extent. But containers have become way more safer over time. The problem is usually really just that running privileged containers is still the standard because it's easier, right? I mean, unprivileged containers have the problem. That, yes, they come with restrictions. And we usually, we expand it more and more. As we grow in confidence that certain kernel features and certain things are safe to do, we expand the ability of username spaces and or find feasible work around such that not just the kernel is in charge of deciding when an operation is safe, for example, in the make-not case with the second notify, but we can also delegate the decision to the container manager, which often has more context than the kernel itself. But there are definitely still limitations. And so running any kind of workload in an unprivileged container is not necessarily without at least some configuration effort easy to be done. But the effort is worth it because if the recent years have shown one thing is that privileged containers cause a lot of CVEs. And a lot of these CVEs, and this is something I mentioned, I think we mentioned in the San Diego talk before, is a lot of those CVEs would simply not possible if you use unprivileged containers. There's obviously an aspect of that the initial tooling for when container runtimes were written was focused on privileged containers. So there is precedence as people are acquainted with a certain set of tools and they don't want to move away from these tools. You have workflows established in your company and so on and migrating to a more secure solution comes with costs that you're not necessary. Or it's just, it's easier this way, but yeah, it's problematic. And I think that's also part of the reputation for a long time that containers were considered less safe. Yes, they are less safe if you don't use them the right way, if you don't use the security features that they give to you for sure. I mean, the question that I think people need to ask themselves is, for the use case I have, is it really licensed and necessary that I use a privileged container or is it just that it makes life easier right now? So it's the security versus easy set up kind of question that you need to ask yourself. And it's not a performance thing, right? I mean, username spaces don't really come with a performance hit. Even if you nest them a lot, I've done the work to be able to, or if you set up like a lot of complicated mappings, you can specify all kinds of crazy mappings inside of username spaces. Even if you do this and you have like crazy amounts of mappings that you have specified, and if you do the actual performance comparison, it's like it's not really relevant. So performance is not even an issue. So if you're worried, username spaces might come with performance costs, no, not really. We're out of time because it's a bit after 6 already. We'll be around. So if you've got more questions, just come to the front. We'll answer those. Thank you.