 Good morning, everyone. I'm Stefan Graber. That's Christian Brunner. We work at Canonical on the Lex-T team. We also are behind the LXC, Lex-CFS projects and been doing some amount of Canon work in the container space for a while. Today we're going to be looking at making containers safer and both some of the things we've done in the past, kind of what our US cases are, and also how to move forward and what still needs to happen to make them that much better. Okay, so first of all, I'll see if I can move that a bit. Sorry. Maybe if I point it up. That might work better. Testing. Yeah, seems a bit better. Cool. All right, so why do we care about safe containers? Not everyone does. Clearly we do. In our case, we've been working on containers for over a decade. So the Lex-C project, which is the original Linux containers project using the main line kernel for containers, that's been around over a decade. Lex-T itself has been around about five years now. Those two really focus on system containers. So our goal is to run unmodified Linux distributions. We are not trying to run single applications or particularly tailored workloads. We do really care about running perfectly standard Linux distro as you find it anywhere else. For Lex-T specifically, we're really very much focused on getting you the same kind of primitives and usability as a virtual machine. So the user doesn't really know. Once they get a shell inside that container, would that they are in a container or on this natural virtual machine? It's not really visible to them. The only really visible thing is that it's using a shared kernel. In such an environment, we care quite deeply about security because some people might actually want to give SSH access to those containers to different users that might run entrusted binaries, that might do a lot of weird things. As a result, we've been focusing on security for both Lex-T and Lex-T for a long time. We're using pretty much every tricks available. Lex-T defaults to impoverished containers using user namespaces. We use DLSMs. We use C groups. We use fancy custom second policies. Pretty much anything you can think of. We're doing it. And we're working pretty hard both in user space and kernel space to just make it much easier for people to run whatever they're normally doing inside such and privileged containers. Our goal is that, like, privileged containers should just not be a thing, and we want people to be able to use our containers just like they were the virtual machine and not have to care about a neighboring container affecting them. So as I said, we run unmodified distros and we run quite a bit of those. We do run support running your standards, Debian Ubuntu, CentOS, Fedora type things. We also support running Android and OpenWRT and some of the other less common Linux. We actually build a lot of those images. We've got them available on our image server. They're updated daily. So right now we build 18 different distros, 77 different releases, or about 300 different images every day, because we were so built on six different architectures. We've got hundreds of thousands of active users that are using those images, creating new containers or putting existing containers every day. We also want all Chromebooks. LexD is used behind the Linux application feature on Chromebooks. It's available on all new Chromebooks at this point and available on a number of older Chromebooks as well, so long as they support the right features. For Chromebook specifics, it's interesting. The way it's set up, they are using both virtual machines and containers. So they are starting a per user virtual machine using cross VM. And then inside that virtual machine, they've got effectively a read-only Linux system which runs LexD with some persistent storage. That's why it's attached to that VM. And then LexD runs on previous containers. They actually contributed codes to our projects to make it impossible to create previous containers in any way, shape, or form. And it's very well integrated with the rest of the Chrome OS platform. They support GPU pass-through, USB pass-through, access to sound card, sharing files, triggering snapshots and backups directly from the Chrome OS interface. Applications installed inside those containers will just seamlessly integrate with the Chrome OS system. So that's why we care. We've got all of those users on various distros or trying to run whatever they feel like, like any normal workload they would normally run inside a virtual machine. They would just run it inside a container, get much, much increased density. But like some of the words I just used might have been a bit confusing, so maybe we should just take a step back and kind of go through some of the terminology around there. I did mention privileged containers and privileged containers. You might also have heard rootless containers before. That's also a thing. Then you've got the user space definition of some of those is also kind of quirky. Docker, for example, has a privileged flag, which doesn't mean that it's unprivileged normally. It just means that it's privileged when you pass it. You also have containers that, like the definition of what a container is, is always kind of odd. How many namespaces do you need to use for it to be an actual container? Do you need to use C groups? Do you need to use LSMs? Do you need to use capabilities? Because there's no such thing as a container in Linux, that's purely a user space thing. We can't just say, oh, you're running a shared container, you're safe, but you also can't say that, oh, you're running a shared container, therefore, you can't be safe. It really matters on how exactly you've set things up. So just to try and differentiate the two main kinds, first of all, you've got your privileged container. Previleged container, at least in our definition, I think everyone can agree on that one, means that route in the container is real route. There is no translation going on or anything. It is running with real, like if you've got the process running in that container, it's running with real route privileges. That sadly was used by the vast majority of containers out there. Most Docker containers, most containers running on Kubernetes are previous containers. The security story for those pretty much entirely relies on properly configured LSMs, capabilities, sec-comp, extra privilege dropping, wherever possible to try and prevent any of those workloads from either ever getting route access or if they need to run as route having as limited a route user as possible so that they can't escape. It's very easy to create. Anyone can create those containers very easily with just a few shell commands, but it's extremely dangerous if you've done it wrong and people have. Our personal stance for Alexian next day is that privileged containers are not route safe and cannot be route safe. We look for a number of years now with actually for pretty much forever with effectively duck lines, any report of a security issue against Alexi or XT that's specifically for previous containers. We just don't go that route safe. If anyone assumes route safety, they're just going to have issues. We do a best effort to shape policies to block any potential holes we are aware of, but we also know that new holes are popping up in Linux pretty much every day or every camera release and that it's a losing battle. That's why we've been focusing so much on the other class of containers and privileged containers. Obviously, those mean that route in the container is not route outside of the container, but then again, you've got a bit of variety there as well because sure, you've got a mapped route user. So if you escape the container, you might be UID 100,000 with pretty limited privileges. But what happens if you get multiple one of those containers? Do you want them all to tie to the same UID? And if you do and they can escape, then are you happy with one container being able to access the other? Maybe not. So you might also have privileged containers that have distinct maps, so either for the route user itself or for the entirety of the container. In next day, we call that isolated mode. If you turn that on, then every single container gets their own 65,544 UIDs and GIDs. That way, if you can escape for some reason at least, there's literally nothing else running that you can train on other than yourself. But also it means that there is no DOS that can trivially be done, which is not the case if you forget to do that step because you might be able to set a user limit on a specific UID inside your container, and that will very happily apply to the same UID in another container. That's not a new thing. People always talk about like, oh, yeah, we might look into the user namespace at some point, but it's been around since the 3.12 kernel. We've had full support for it in LXC since that time. The first LXD release we pushed out was on an Ubuntu release that was shipping with the 3.13 kernel at the time. And it's been around a while. We understand it. It can be used. None of the semantics have really changed since, yet we're pretty much the only ones using it. One of the big issues for some of the other container runtimes is the file system aspect of this. As soon as you run distinct ID maps for every container, if they see a file coming from another container, it's going to show up as the overflow UID. It's going to show up as minus one effectively, and you might not be able to read or write that data anymore. It's fine for LXD containers because it is like a limitation that people understand, and sure they can't attach a shared volume to two isolated containers. That's where a solution they can understand. The problem is when you look at something like Docker which uses layers, and those layers are then stacked on top of each other and then used as the root file system of different containers, that doesn't work because the layers themselves include UIDs and GIDs. There is a single copy of them on the file system. They are then layered, and then that's passed into the container. If you use user namespaces, then everything inside the container shows up as the overflow UID. We've actually done some work pretty recently to address those concerns, not specifically for Docker because we're not working on Docker, but for the data sharing between isolated LXD containers. That work is ShiftFS, which is effectively an overlay file system that does shifting of UIDs and GIDs across user namespace boundaries. It's obviously dangerous if you set it wrong and you do it on slash of the host because then, hey, you can do whatever you want, but it's very useful if you set it up right as a way of sharing data between containers. It could also be used on top of a stack of layers to then allow for unprivileged application containers to work properly. Right now, that particular work is a distro patch in the Ubuntu kernel, mostly because the current approach is not considered to be really suitable for upstream. We are waiting to instead add that feature to the new mount API in the upstream kernel, which seems like a much better fit, but we needed something quick, so we did something quick. And maybe the last thing to mention is rootless containers, which is maybe another word you've heard before. That normally refers to what we were calling before fully unprivileged containers. So that's containers that not only are they unprivileged in the sense that they use a user namespace, but also the unprivileged in the sense that they were spawned by a unprivileged user. That's mostly possible. You can unshare a user namespace and set the user namespaces as an unprivileged user. That's fine. We've done that and supported it for years, nine in LXA. The problem usually hit is if you want more than just your own UID as UID0 inside a container, then you will need some set UID binary to set up more maps for you. You can use the ones that come with shadow, so that's new UID map, new GID map, those are there, and they use the itc sub UID, itc sub GID files to control that behavior. The other thing you're going to run into is, hey, how about networking? Do you want your own networking interfaces? Because if you do, you can create them, but there's no way you can bridge them on the host side or configure them on the host side as an unprivileged user, so you need another set UID binary for that. As a result, UID containers are somewhat limited use right now because you just end up having to pile up a whole bunch of UID binaries, helpers everywhere to try and just do the pieces you need, whereas most container managers just run as root on the host and then spawn children container that are unprivileged. If done right and there's no API access or anything from the container back to the host, then you don't really have a concern doing that either. All right. Sir, I'm going to hand over things to Christian now. Cool. Oh, that works. Sorry. I need to remember to keep this near my mouth. So he talked about, Stefan talked about privilege and unprivileged containers and one of the questions you probably have, or a lot of people have, is are they really that unsafe? Yes, they really are. This is a list of pretty bad CVEs that we had over time and it's just against a single runtime because as we said, we don't accept CVEs for privileged containers. And 2019 actually started off with a pretty bad one. I don't know who remembers. CVE 2019-5736, which was a bunch of things at the same time, arbitrary code execution, container breakout, privilege escalation, whatever you want it to be. And one of the versions was essentially you tricked the runtime binary into executing itself. It cached a file descriptor to that binary and then it overrode that binary. So when you re-executed the container binary the next time, well, whatever the container wrote into that binary, you know, running and that could do anything you wanted it to be. It was pretty bad. And there is a pretty interesting pattern, I think, to this. All of these attacks right here, you should double check. I'm claiming this right now and I looked at all of them, but I'm not complex. If I missed something, I'm sorry. But all of these CVEs should not have been possible if you've used in privileged containers, so if you've used username spaces. CVE 2019-5736 is not possible with unprivileged containers and all of the other ones wouldn't have been possible too. So this really matters. And we can't, as Stefan pointed out, we can't really guarantee that there aren't any exploits out there against privileged containers. I'm pretty sure that if Jan went to look at the ptrace code close enough, it would be pretty easy to find a bunch of other ones too. Actually there was a reason one, right? For ptrace as well. Yeah. So there's a bunch of holes in there. So privileged containers matter. And we're in a state where you can actually use them for a lot of different stuff. And so this is about making containers safer. And one of the things that should be fairly trivial, but is actually not, is make use of all of the existing solutions that are out there, which means all of the security features that we currently have, you should use them. And often when you look at security issues, you see, okay, this is not really so much a problem of there is no mechanism that would block this. It's a matter of, a feature hasn't been used. Multiple reasons, I don't know. Sometimes too complicated to understand, not well enough documented. That's obviously on, if it's a kernel feature, it's on us on kernel developers who haven't documented it sufficiently. But yeah, one should definitely use them. Namespace, I'm not going to go and explain to you what a container is. I'm just going to give some general guidelines before diving into new features that we've currently seen. Namespaces, we have a bunch of them, seven, I think. And too few namespaces are often used. So if you look at the application container world out there, if you look at especially HPC workloads out there, it's often time people are like, I just want to do amount namespace and then still run untrusted workloads in there. And that basically means you have problems. It's trivially to going to break out of this. And all of the namespaces usually have some sort of security benefit, network namespace. I think I don't need to explain that username space, mount namespaces and so on. It isolates the system in some sort of way. The most obvious one is the username space. This is something we keep coming back to as well, because it's the only namespace that is concerned with isolating privileges on a standard Unix system. So capabilities per username space, UIDs, GIDs, translations, and so on. And yeah, for sure, it's a clunky API. There is nothing. We can't have debates about that. Multiple people have pointed out we should probably make it nicer in some way. It's just a matter of who does it and how to make it nicer. For example, there are issues about how you create namespaces at container setup time. I think there is an ordering issue with the username spaces and network namespaces. If I remember correctly and this hasn't been fixed, then you can create a clone in your container process by specifying username space and network namespace at the same time. Because then all of your network, I think your network devices are not owned by UAD0 in your new username space. So you need to create a new username space and then you need to unshare the network namespace and so on. That's obviously stuff that probably should have been done by the kernel, but it's been outsourced to user space, so that's something people have run into. And also, I think a big issue currently is you can't atomically set an S to all namespaces of a process, so you have to do it iteratively, each single namespace. I have some ideas which I think I proposed on the mailing list a while back on how we can actually make this work. Now we have the infrastructure to actually do this. But yeah, use all of the namespaces. Obviously, the two big topics are Seccomp and LSMs. Seccomp is essential for privileged containers, obviously, because you can trivially break out of a container if you allow any syscalls, if you allow a certain class of syscalls. Open by Handlet, for example, is pretty good. It allows you to traverse back to host root if it's on the same device, right? That was the CVE in 2014. Well, no. Actually, it was a CVE against OpenVZ, right? So it's pretty, pretty old. Yeah, it was originally that there was a CVE even against OpenVZ for that particular attack, and that was then also affecting Docker. That was the shocker exploit vulnerability at the time, which if you were not on a dedicated mount for the container, you could traverse back to the host, and then potentially, like if your parent was slashed, then you can go and modify its C password, its C shadow, or whatever you want, because you're still root, so pretty bad. Right. So I'm just going to gloss over this. So for privileged containers, it's pretty essential. You probably need to maintain a whitelist, not a blacklist. There's not a whole lot of syscalls, there's a bunch of syscalls that are safe, but you probably want to block more than you want to allow. For unprivileged containers, you usually get along with the blacklist. We still use it in unprivileged containers because it's nice for syscall blocking, obviously, for legacy syscalls. Locking what syscalls as a container has been performed, or it's a huge feature I'm going to be talking about in a little bit, the syscall interception stuff, or to deal with broken user space. It's pretty nifty, actually. Going to come back to SecComp in a bit. And obviously, LSMs, I don't need to talk about LSMs a lot, I think, at the Linux security summit. There's going to be an update as well. I think, essential, again, for unprivileged containers, there's a lot of files in procFS and sysFS that you definitely don't want even a privileged container to be able to read. They are all blocked in unprivileged containers, so by virtue of username spaces. So you need LSMs to block access to a bunch of files. And the most frequently used, of course, as a Linux app armor, they're a bunch more load pin. Tom, you always smack a new safe set ID, which I'm going to mention in a little bit. One can use them in unprivileged containers as well because more security, but there's actually no real need, I think. Yeah, so let's talk about a couple of new features that landed recently and stuff that is planned that makes containers, hopefully, a little safer. Yeah, so one of the bigger things is Seccomp, notify target, or outsourcing decisions of what syscalls are going to be successful or not to user space. That's a pretty big deal. I don't know how many people have heard of this. It's been landed in 5.0. Yeah, okay, a bunch of people heard of it. Okay, so have you heard of it? So the nice thing is it allows running privileged containers, unprivileged containers with even less privileges. It's also pretty helpful for privileged containers, to be honest. You can grant very fine-grained privileges. And what you do, or what this feature essentially is, if a task loads a Seccomp filter, it can get a file descriptor to that Seccomp filter, and then it can send that file descriptor to a more privileged, like different user space process recently, usually. And that file descriptor is polable, so you get an event when a syscall, a relevant syscall, which you have registered in your Seccomp filter, and then you can read the actual syscall arguments, well, the integer arguments. And then the privileged user space process can inspect these arguments. It can also inspect in a race-free way. I'm not going to go into detail how this is possible. The memory of that syscall, so you can also do the work of parsing out paths and so on. And then if it decides, okay, this syscall is safe to make, perform that syscall in lieu of the container, meaning it does all of the work that usually the kernel would do, which sounds really nice, but it's also problematic because it means you need to make sure that you assume a sufficient amount of credentials of the tasks you're performing the syscall in lieu of, while at the same time not assuming the ones that block you from doing so. That's pretty annoying, to be honest, but we think there might be a solution to do this better. We use it, for example, to intercept make-not. So dev console, dev null, dev zero, dev random, dev few random are usually devices that you are fine with delegating to unprivileged containers as well, and container managers right now just bind-mount them from the host, but there is no reason to actually do this, but I get why the kernel doesn't want to maintain a list of devices that are safe to create. So you can register a Seccomp filter that specifies, okay, if it's a make-not syscall for these devices identified by the device number, please send me a notification. Then you have a privileged user space process, a container manager, in this case, like Steve, for example. It reads the syscall arguments, looks at it, looks at the device number and finds out, okay, this is dev null, yeah, whatever, I'm going to create the device node for you, then assumes the credentials of the process, it is doing the syscall in Lyov, it's doing the make-not for it, and you're done. So this is obviously a pretty powerful mechanism. You could intercept mount, you could intercept whatever you want. You just need to be very, very careful that you obviously that you're doing it right and that you don't create devices that you don't want to create. But yeah, that's a pretty nifty feature. So one of the things obviously is, so for make-not, it's not an issue. If that's safe, for example, I register a filter and I say, give me all make-not syscalls indiscriminately, and I intercept a bunch of make-not syscalls that match devices that I would want to allow and a bunch of those that I wouldn't want to allow. If I intercept all of them, it doesn't really matter because make-not in username spaces is not possible. So there I mean, there is no device that you can create in the username space. So if you hit a bunch of syscalls that would otherwise succeed. Well, yeah, okay. Any interesting character devices and block devices. We actually had that issue at the beginning in Lexi where we were intercepting all of the make-not codes and that was causing problems with things that were not character or block devices. Exactly. So but if you want to intercept differences calls that would partially succeed in unprivileged containers for some arguments but not for others, then you obviously have a problem because right now you might intercept the ones that you could, that you can do in lieu of the container would otherwise fail. But all of the other ones that would usually succeed, you now need to do them as well, which is again pretty tricky in so far as assuming the right credentials and so on. So what would be pretty nice if we could somehow tell SACOM to resume a given syscall, like for example, you intercept a syscall, you inspect the arguments and you as a user space process, especially if you manage a container or another process, you usually will know when a syscall will succeed. You can tell and then you're like, oh please Colonel, go ahead with the syscall. And yeah, it would be pretty good if we had this. I think I sent the mail to the case-summit-discuss-mating list. It's a slightly not case, no. But yeah, it's a discussion to be had because it's not easy, it's not trivial to do I think. But yeah, that's something we would really want. Exactly. Point no-race privileges, I think the most important one, the execution just continues with the privileges of the original tasks. You don't need to muck with any permissions and so on. It would be pretty helpful. And another one is extended syscall filtering. That discussion has been popped up quite a bit over recent times. I think now it's even made it into a BPF thread that is currently going on. So we have a bunch of syscalls that carry flag arguments and these flag arguments are usually passed in registers. So they're readily available and not as pointers. They're readily available for Seccom to filter them. Pretty neat traditional clone syscalls and so on. But at the same time we have a bunch that don't. And we have new syscalls that maybe don't want to pass flag arguments as a registers or in registers, but as part of pointer arguments such as structs. The recent syscall that we added was the clone 3 syscall. The flags argument has moved into a dedicated argument struct and we still would like Seccom to be able to filter these arguments. And there's a dedicated discussion around this. Andy Litomirsky has made a proposal, I think on, you can follow this mailing list thread, has made a proposal on how to do this without BPF or without bringing unprivileged EVPF into the game and I think the common understanding is that we don't really want that to happen unprivileged EVPF because of this feature. But yeah, it's going to be also I think pretty technical, a pretty difficult technical challenge to come around. But it would be pretty neat if we had this. Okay. I probably don't need to say a lot about this, right? Unless I'm stacking, it's going to be an update on Wednesday. Yeah, okay. And what we would like to do is if you have a Fedora host that runs SELinux and SELinux policy, you run a container with Ubuntu that uses app armor, none of the app armor policies that usually confine apps on Ubuntu would now be usable because you currently cannot stack unless I'm wrong, app armor on SELinux or SELinux on app armor. And you currently can stack minor LSMs with major LSMs, so Tomiou load pin, app armor and SELinux. This has been work that has been done recently, but the ultimate goal I think is still to stack LSMs on top of each other. That would be pretty neat. It would block a lot of use cases. The safe SELinux security module that I recently stumbled upon, it has been merged or is going to show up in 5.3, which comes from the Chrome OS guys, which restricts ID transitions through SELinux, given a system-wide policy, which will probably be most useful for privileged containers because you can limit a container to a limited range of UIDs and GIDs. So going to be pretty helpful for that. I don't think we have a particular use case in mind right now for us. And the mount API. This has been mentioned before. David and L have been working on this quite a bit. Ideas use file descriptors for mounting, for configuring, setting up mounts and so on. Split one syscall that is heavily overloaded with multiple tasks at the same time into a bunch of syscalls. I think we have seven right now, but we need eight or nine. And it's going to hopefully have a bunch of nice features such as recursively applying mount options to a whole mount three. At least David is working on this. It has anonymous mounts, which is a feature I've been wanting to have wanted for a long time, which means you can access a mount that you have configured and set up, but it is not attached to any path in the file system. But you can still access files and so on. And it avoids numerous race conditions as well. You have been thinking about a good example before, right? Well, it was the usual race type of race conditions we've got during attach. So if you want to spawn a process inside the running container, you will need to pull some information from it and you might need access to PROC to rewrite your LSM label to do some other operations. But you can't trust PROC inside of the mountain space of the container, because the root in the container might have mounted like a tempFS and then pretended that things looked okay or mount a fuse file system that pretends enough of PROC that you think you're actually writing an LSM label when you're not. There was a CVE for this as well. Yeah, there were a number of issues around that for a number of container managers. Exactly. The way we do it right now is we use DRFDs. So we open PROC as a DRFD on the host, then we do the attachment and we do everything relative to that DRFD. And that kind of works, but it's a bit of a pain. And being able to just hold the DRFD on two particular mounts is going to make some of that much easier for us. One of the things that David has told me he's going to work on, which I find pretty, it's a pretty good idea, is being able to set up ID shifting. This is basically shift-a-face, which you mentioned before. There is some discussion going on whether it's supposed to be tied to a username space or it's going to be not tied to be username space. But it would be pretty neat if you could say map these IDs for this in this mount and also to set the namespaces of a mount, which is a feature that is lacking in the current API currently. So often we inject mounts into a container across mount namespaces, and there is a whole lot of trickery involved to get this actually done. You can't just find mounted somewhere because of how mount namespaces work. So if you could say, given the right privileges, inject this mount into this container, basically say these namespaces, that would be pretty neat if that would work out. And one piece of work I should mention, and I didn't put on a slide, is work that's been done by a good friend of mine, Alexa. He's working on restricting path resolution on Linux by proposing a new syscall open at 2. That would be pretty neat. The idea being that you attach a set of permissions to a file descriptor that you can then later escalate to more privileges. Like what you can do right now is you can open a file descriptor as read-only, and then through proc and open at trickery, reopen it as read-write, and this API would block you from doing this. Also, you could do nice features such as if you have a file descriptor, if you have a dear file descriptor to a root directory of a container, you can never walk out of the container. It's always going to, resolution is always going to be relative to this DRFD. It forms the new root, essentially, of your system. That's pretty exciting work. It's blocked on L. Let's put it like this. And yeah, hopefully this is going to be landing, it's going to land at some point soon in the future. This will be a big security improvement as well. One thing we've also been working on, and I'm going to close with this, not the key ring stuff. I think, well, key ring namespacing, that's what David's been working on as well. Ultimate goal being to have key rings be able to use the non-privileged containers, so the Dushet network file systems can authenticate against a server with their own individual key and so on. A bunch of that infrastructure has landed in 5.3. I'm not sure if it's completely usable at this point in time, but the ultimate goal is to get it working. By the way, this was also his reason for proposing containers as kernel objects at one point. Apparently, he has abandoned this idea. And the last part is which we've used for our container manager, or using for our container manager. It's using file descriptors for processes to eliminate a bunch of racists that have existed on Linux for quite a long time. There's been some work going on in this direction. So, right now you can get in your file descriptor from clone with clone pdfd, or from clone 3 with that flag said. You can send signals through pdfd, send signals to those file descriptors. You can get in pdfd for an existing process with pdfd open. You can also poll to get exit notifications for non-child processes, which is pretty handy. Obviously, for us it's because we have spawn sub-demons, right? We spawn sub-demons, and they usually have pit files, and then we parse out the pit of the pit file, and then we send a signal to that process, and so on. It's all racy, obviously, if that process exits and gets respawned with a different pit, but the pit that's in the pit file is reused, and you have obviously going to have a problem, and so on. So, this hopefully eliminates a bunch of those problems, and there's more features I have planned, or we have planned around this API. And yeah, that's it for me. I think you want to do some closing words, Stefan. Yeah, I was just going to say, despite the name of the talk, making containers safer, there's no one way of just making it happen. It definitely depends on what you're doing. What I would say is try to not reinvent the wheel. There are a number of container managers out there that have gone through the pain of all of those issues and have figured out ways of doing it. If you can, you can even use LXC as a library to try and do some of that stuff for you so you don't risk hitting some of those annoying kind of issues. The one recommendation, obviously, and I think we've said it multiple times during this talk, is do not use privileged containers. If you're not using the username space, you just can't make things safe. It's just people should really get to understand that and move as far away as possible from any kind of privileged containers, because these security issues will keep happening. There's really no way around it. It's not something that we'll ever make safe, and there's really no reason for that either. So yeah, that's it for us. I think we're kind of out of time, so I'm not sure if we can really do questions at this stage. Otherwise, we'll still have a bunch of stickers in front if people want to go and get some of those afterwards. Questions? The question is on this notify feature of SecComp, right, that you're introducing. How deep can you go? Suppose you have an iNode in the system code. Can you nominate all the passes for that iNode? Can you, I don't know, go to the super blog, get the device underneath, figure out whether that device is removable? How flexible that user analysis of the system code would be? It is pretty limited in the sense. So what you get is you know what the task ID is, and you've got the pointers to the arguments, and that's all you get from user space. After that, normally what we'll do is we'll effectively end the thread itself is temporarily frozen by the kernel while you're processing. So user space can go and duplicate the memory or some of the other properties of the process to try to analyze. You can read the pointer arguments by going through procpidmem, I think, and then also you can check there is a cookie that comes with each syscall that is made, and then you can use that cookie and check whether it's still valid, like the task is still alive, and it's not operating on... Yeah, exactly. And so you read the memory, you can do all of the analysis that you want, but it's pretty annoying, obviously, because it's costly, it requires proc and so on, and the most important part, the most difficult part is really if you're a privileged process and you perform operations in terms of a less privileged task, you have always have to assume the privileges of the less privileged task, and that's really problematic because you need to be very, very sure that you take all of the things in mind. Think about make not. You need to make sure that you're not correct device a C group, in case there is a device list, a device policy that blocks you from it. You probably also need to assume, like if the process is crouted, you need to be attached to a mountain namespace. Yeah, we had a lot of fun around CHroot, CWD, and mountain namespaces to actually get all the right pieces in the right order, so that we would be at the right spot, but also would still have enough privileges to actually do something, because we could do it reasonably easily by also attaching to the user namespace, but then we effectively lost our privileges and we couldn't actually do the make not anymore. And we even had other cases where mounts that originate from each side of the mountain namespace are owned by root inside the namespace, and therefore is automatically marked as no dev, so we can't actually create stuff. Exactly. I mean, there is also there is a bunch of other stuff. As soon as you have a file system mounted inside of the new mountain namespace, you cannot just simply create the device node in the original underlying file system, because obviously there's something mounted above it, and there is no way to create a device node for a file system mounted from inside the user namespace, because there's the SPI node flex that prevents you from doing so, details don't matter, so you need to inject mounts into the container if you want to have the device, and it's all just really messy. It's a pretty powerful mechanism. It allows you to get rid of fake root, but yeah. But, well, you encounter some issues with second argument filtering, because obviously filtering flags is easier, but filtering pointers doesn't make sense, and that is exactly the same issues I encounter with Lonloc. So, as you know, I use EBPF to try to make something with scanner pointer to try to filter an object, because in fact, it's what you want to do. So, it would be interesting to know if you have some leads to be able to filter an object without the EBPF stuff, like I do with Lonloc. And if you ever think about using or extending Lonloc for your use case. I think there have been a couple of proposals. I think Andy made an idea of what was it, mark a bunch of syscalls as filterable, right? But that's something you really cased it really didn't like. He agreed, yeah. Yeah, I mean, it's right now it appears to be a question of where to do it. We could either cache the arguments as they're coming in and test them in two places, or we can have that analysis happening deeper at like the LSM level, where you have to have a different idea of what the arguments are. I don't know, you're familiar with it too. So, it's on the agenda for the kernel summit is to try to nail something down for this. So, in essence, essentially, no concrete proposals yet. It's a really... I tried to hide the fact, but no. Okay, thanks for the talk.