 I'm Stefan Kraber, I work at Canonical, I'm the LXCNXD project leader, and today we're going to be talking about system containers and kernel features. So briefly, what are system containers? They are the oldest type of containers, really. They originated with BSD jails, I don't know, 12 years ago, they're about. Linux V7, sorry, zones, then open VZ, LXC, and now LXC, which I'm working on. The main goal of system containers is to behave exactly like a physical system or virtual machine. There are no special images or anything, you run a full and modified Linux distro effectively inside your container, and interact with that exactly as if it was a normal system. No virtualization is needed, because that's still a container, that's not your point of containers. Now, as for LXD itself, LXD is a somewhat new, I mean, I keep saying new, but it's like three or something years old at this point. Container manager with a REST API that you can easily script. It's got a nice and user-friendly command line. It's pretty fast, it is secure by default. We use username spaces for all of our containers, unless you opt out of that. You can even have per container maps of your IDs and GIDs if you want to make them even safer. We use all of the available LSMs, so SecComp, AppArmer, we use capabilities. We use pretty much every single bit of kernel API that's available to make them safe. And it's pretty scalable. You can use it locally for your two or three containers, or you can go full cluster and run 10,000 containers if you feel like it. That's the same API, same CLI, same user experience, and you can scale very easily. For those of you who've played with Chromebooks somewhat lately, they've got a new Linux apps feature on the Chromebooks. Well, that's LXD. LXD is shipping on all the Chromebooks these days, and that's used to run Debian containers directly on your Chromebook. Now, for what LXD isn't, as I mentioned, it's not a vitalization technology. It does not use any CPU extensions for that regard. You can totally run LXD on a Raspberry Pi or whatever you feel like it. It works on just about every architecture out there. It is also not a fork of LXD. It is a GoDemon that uses the GoLXD binding and LibLXD under the hood to drive the counter interactions and use all the nice counter features. Because doing that directly from Go is not always pleasant. And it's also not an application container manager. So we will not be running Docker containers with LXD. We don't really have an intention of doing that anytime soon. You can totally install Docker inside a LXD container. If you feel like it, that works just fine. But we really see application containers as a way of distributing a particular piece of software, whereas our focus is through an entire machine inside a container. Also, for the rest of the talk, we're going to be going through a bunch of new kernel features and other bits of interesting API we are using for system containers. I'll mostly be focusing on the privileged containers. So I don't really recommend anyone run privileged containers in general. So when we say something can't be done or we need new kernel APIs, it usually means something cannot be done inside an unprivileged container. Yes, if you've got full-route access within a privileged container, you can probably do it. But you can also probably break the entire system. So just something to keep in mind for the rest of this talk. Now, the first thing I want to go through is devices. Why would you want devices attached to a container? Well, maybe you want a GPU. Maybe you want some USB device. Maybe you're doing HPC and you care about infinite networking and RDMA. Maybe you need direct network access because you don't want to use bridging and all that stuff on your fancy 100 gigabit network card, for example. Or you just want access to any character block device on the system. Let's see, a USB serial link or some science equipment or whatever. Containers are a bit special in that regard. There's no such thing as a device namespace. There's no nice way of attaching devices to a container. Containers can run Udev. And flexibility containers usually do. But they don't really get any U-events, which makes that somewhat pointless until we've got some kernel API we're working on. Containers also cannot use DevTempFS. That is not in a very useful way, which means that you need to pre-create all the device nodes that a given container needs to use. That also gets funny when a container is running and you want to inject a new device inside it. Because as I've shown you in a bit, you can't actually make nodes inside a previous container. So you need to use manned propagation tricks to propagate a device from the host into the A running container. So let's just show a few interesting things. Let's do that. And let's do that. OK, the first thing I want to show is the entire make node issue. I'm running a modern kernel. So I'm running a 4.18 kernel on there, which has an interesting behavior. Interesting in that it broke a bunch of user space. But you can actually make node. Oops, it already exists. So let's just delete and create it. I've make-noded major 1, minor 3, that means dev-node. So if we compare dev-node with dev-node, they look kind of the same. You can write to both of them. And major and minor lines up. So that should work. So if I was to write to dev-node, no problem whatsoever. If I try to write to blah, it doesn't work. That's the new behavior in the 4.18 kernel, which does let you make node things, but also marks them in a way that makes them completely useless afterwards. That's slightly frustrating for any piece of software that tries to make node. And then if that fails, do something sensible. Because now it doesn't fail. It just fails to use it later on. So that's something to keep in mind. The old behavior was that make node just wasn't allowed at all. In any way, even though you can make node these days, you can't make node anything useful. So might as well consider you don't have it. As far as the device is, let's look at a GPU case. I've got a container. This container is kind of boring because I forgot to delete the device. Sorry. It should have been empty. Let's make it empty. So node fdri nodes. And now let's say I want to pass the GPU. And I want a specific one. So I'm actually going to give it to a PCI address. Now we need to do the interesting logic of going to assist, figure out what device nodes are tied to that particular, what driver is tied to that particular address, what devices are tied to it. Lexi does that. It uses a manned propagation trick to inject those devices inside the container. And you get those. And you can sometimes actually make use of that. That's much better. So that's what I'm seeing here, which is the Unigine Heaven benchmark running inside the container that's got the GPU access and access to the X server. And as you can see, that's running just fine. Let's close that stuff. And get back to this. So that's kind of where things are. But we also added a new kernel API recently that lets you actually inject your events from user space into a particular container. So going forward, the idea is that we will have Lexi, as it does today, listen to your events. If they're relevant to the container, they can then be injected inside the container, which then means you'd have inside the container can react to them and can do useful things. We've got people trying to run Codi and X servers and whatnot inside containers. And they've got a bit of a problem when they plug like a USB keyboard or minds. That kind of stuff just doesn't show up. And X just ignores it. You need to actually bounce X so that it notices something's been plugged. With your event injection, that's been done by Christian Brown on my team. We're going to be able to fix that. That's already a main line. We just need to use it from user space. Another thing that we're looking into for, well, that we need to deal with for system containers is security modules. So if you've got a full machine, you may want to protect your services. That means attaching a Parma policy or second policies or SLNX directly to a bunch of services running in there. Sometimes the internet system will do that for you. Sometimes you do it on the side, whatever you feel like. But that was a bit of a problem when we couldn't do that inside containers. We also had the issue back in the days where the host policies, at least for a Parma, which is path-based, were actually leaking into the container. So if you had a policy for some binary on the host and the same binary existed in the container, the policy would just magically apply to it. It might be a completely different distro, and the profile might not be relevant at all. And when you do nesting as well, so you run Docker inside the ElixD container, it kind of matters for Docker to be able to run its normal Parma profile, or a sec-comp, or whatever else. So that's been... We'll go into some more detail later, but that's been fixed for Parma, at least. It is possible to load Parma profiles as an unprivileged user, effectively, so as root inside an unprivileged container, and have things names-based in a way that the container has its own set of profiles. The host profiles don't leak into the container, but the host policy still applies on top of whatever is loaded inside the container. It does get tricky for some other APIs. Like, for example, anything that's gonna be based on EBPF is not suitable for that, because EBPF cannot be trusted for unprivileged users. Mostly because EBPF can be used for timing attacks and effectively exploiting the Spectre bug. So ever since the Spectre meltdown mitigations, EBPF is no longer allowed for unprivileged users, and so not available for unprivileged containers anymore. Now, for Parma, as I mentioned, Parma does support running inside containers, so that's done through internal nesting support in Parma. You can create a namespace, create effectively a stack, and then say that's your other profile, that's your inner profile, and if that profile allows policy loading, then the container can load extra policies. That lets you load, unload, list profiles, but there is one big limitation right now, which is single level. So you can do it in a container, but the container cannot then create a second level inside it. That's something that's being looked at, but there are a bunch of missing kernel LSM hooks that need to be sorted for that. But I can show you that part already. So, let's get out of that container. And so for Parma, I've got a basic container. Let's install something that's confined, so there's that convenient hello world snap, which if it feels like installing, comes with a convenient Parma profile and a test. So if we look now, Parma's status shows that the profile's been loaded, actually a bunch of profiles for different subcommands in there, and if we run the command itself, it's fine. If we run the eval subcommand, which tries to do something it's not supposed to be able to do, it is not able to do it. And if we go in and grab stuff, we should see, yeah, the bottom one shows he denied by Parma preventing it from writing into a path it was not supposed to. So that's Parma stacking. Now the real thing we want to get to is what's called LSM stacking and in spacing. With that, instead of having like a per LSM, like specific to a Parma type solution, all the major LSMs should be able to stack and run at the same time. So you should be able to boot a system with both a Selenux and a Parma enable at the same time. The sites that you'll display LSM for the system, so the main LSM is gonna be a Selenux, and then when you start a container, set that container's display LSM to a Parma, if it's Debian Ubuntu or whatever else is using a Parma, and then inside there, they can interact with a Parma and pretty much never know that a Selenux is even a thing on the system. The Selenux host policy will still apply, so any access actually ends up going to the entire stack and being validated by both. That's work that's been going on for a few years now by Casey Schaffler and John Hansen. There are patches that do work. We're still pretty farthing from them being merged, but that's where we headed and it's gonna be pretty neat because it will let us run Ubuntu and Debian containers on CentOS and having both the host and containers fully secure. And similarly, we'll be able to run Android or CentOS containers on Ubuntu and have a Selenux run inside the container. So that's gonna be pretty darn neat. Now, another interesting topic is file capabilities. You may know that in some distros, things like Ping, MTR, some of the privileged creation wrappers and whatnot actually use file capabilities instead of SETUID because it's much more granular and in general, a better idea. We had a bit of a problem with containers in that it was not possible for a privileged user so root inside an privileged container to set a capability. The main issue being there that if you could do that, then you would be able to exploit it from outside the container so that was considered to be bad and therefore blocked. The V3 file capabilities support that's been merged now a few kernel releases ago changes that. As part of the capability record on disk, it stores what the root UID was, which then lets the kernel know when to actually consider the capability. We can see an example of that if I go in a sentos container, if I can type, there we go, and say install HTTPD. Oops, sorry, that's a big sound. There we go. So I'm installing HTTPD. Again, if network wants to cooperate, we'll see. Go home. Well, network doesn't, okay. It failed, but then seems to still work, okay, fine. I'm trying. There we go. All right, so HTTPD is installed. That used to fail miserably. A kernel that doesn't support V3 caps would just fail the unpack because CPI wouldn't be able to set the capability. Now, if we check that one file that ships with the package has got two capabilities set on it and that works exactly as expected. And if we look, rip, Apache can work just fine. Another thing that we've had issues in the past has been mounting stuff inside containers. That's been a bit of a recurring problem. Some people do want that for things like loop-mounted files or mounting SquashFSs or mounting network storage or even like passing some networked block device and wanting to mount that in the container. It is not supported in general because it's a very bad idea from a security point of view. That's because the kernel will have to pass the block device you give it that the user has complete control over and you can then explore very interesting kernel bugs and do a bunch of nasty things. In the case of the loop devices, also some issue with you being able to still modify the device after it's been mounted and confuse the kernel even more. So, yeah, that's a bit of a problem and we do not expect file systems to really fix that but there are some ways out of there. For virtual file systems, it's usually pretty safe. So we should be able to make things like NFS work just fine in theory. The NFS is a bit of a weird beast sometimes. One thing that has been done is fuse. So we can actually mount anything that fuse supports. We can see that here. I've got that container, that container's got a SquashFS and I can mount stuff and that works just fine. So that's unprivileged few supports that's been merged. I think in 4.17 or 4.18, about, took a while. We had it in Ubuntu for a long time but I'm streaming it, took a while. So that's one way of doing things. The other thing that we are working on is the issue of UIDs and GID maps which is a bit of a problem with mounts in general because you may want some, like in our case we're dealing with system containers. So we've got a full root file system per container. We don't have the entire issue of like read only images and all that stuff. But we still have the issue of having containers with different maps and wanting to share data between them. That's a bit of a problem in general. Right now there's no good solution for that. You can try doing POSIX ACLs and that kind of works but it's very confusing for people. And for the root file system itself, it means that when we create the container we've got to shift it, which means we need to go through every single file inside there. We need to change the user UID and GID and we need to change any POSIX ACLs and we need to change any file system capabilities. Not very firm. It's fine, it works. We've done it for years now but we want something faster. And that's what ShiftFS gets us. It's in progress. It's been written originally by James Bollumly. I've got Seth Forshee on my team actively working on fixing a bunch of extra issues with it. With ShiftFS it lets you take a directory that's not mapped and tell the kernel please mount it over there but apply this map. And you can do that multiple times to different containers with different maps and they will all see it as their own UIDs and GIDs. And things just work effectively. So that's pretty interesting and hopefully we'll be there when the next year. Now the other thing that we are working on is like what if you trust your users? I mean that might be a thing. Right now there's no way in the kernel to allow those mounts. It's just not possible. But with work being done right now by Tycho Anderson we've second intercept system calls in user space. And then I have user space run whatever it wants as real root which then lets us catch mount for example compare the arguments to a white list we've got. And if we consider that this one container we actually trust and we're fine with that file system being mounted from that location, we can perform that action as real root and then move on and you've just performed the mount. Things just work. So we're pretty excited about this particular feature. Mount is one of the things we want to use it for. There are a bunch more use cases that we do want to let you make node things like dev null and not have it be useless. So that same feature will let us do that. Another thing we're working on well we've been working on in the past is limits and something that containers kind of need. I'm just gonna go demo instead for long since we're running out on time a bit. Let's see. So here I've got laptop uptime is two days. If I go inside the container we can see that the uptime is a few minutes. So, oh, I was wondering. So we've got the actual uptime of the container. We can see that the container right now says four CPUs and 16 gigs of RAM but we can change that. Set CPU, set memory and go through uptime hasn't changed. We've not restarted the container. We've got two CPUs. We've got one gig of RAM. The limits are applied to C groups, no big surprises there but C groups are not respected in proc files. So we're using LexiFS which is a few files that we wrote a while back and that's mounted on top of those proc files and gives you the actual output you would expect based on the limits applied to the container. Now, C group V2 is obviously something we're looking at. We're currently missing a few things in depth before we can switch. One of those things is the freezer C group on equivalent of it. We do need to occasionally freeze all the tasks in the container in a reliable way. So that's something we need to resolve still. We've got, we also need to figure out a nicer way of managing device filters because that's another thing we need to do at least for previous containers and the BPF API is a bit tricky to deal with sometimes for that. Overall, C group V2 will get us less overhead, safer to use, more suited for containers. We do have a bit of an issue still with legacy workloads. So if you run a system container that uses an internet system that configures C groups but doesn't know what C group V2 is on a C group V2 on this system it's gonna have a very bad time. So we need to figure out whether we can delete things enough that it's not a case we need to care about because those containers will be end of life or if we need to do some fuse trickery or something to fake a C group V1 five system on C group V2 which we've done before. Alex CFS does support faking an entire C group V2 tree already. So it's not that much of a stretch for us to do that. But still, we would like to avoid doing it if at all possible. And lastly, because I'm actually running out of time another thing that's always kind of exciting is checkpoint restore which lets you do live migration and rollback of state of processes inside a container. It's a very, very complex problem because you need to serialize everything to disk which is yeah, a bit of a pain. It's also the biggest game of walk-a-mole in Kernel town because every time someone implements a new kernel API they break checkpoint restore and they need to figure out a way of getting that stuff out of the kernel into a file so it can be recreated. So that's a lot of fun. Rather than go into more details I can just show you what it looks like when it works, if it works. So let me switch screen there, all right. So I've got a container running now. I can do stateful stop. Please work, yes. So that worked, that container is now gone. It's not running anymore. I could now reboot my system to apply a kernel update or whatever and then you just start it again. And it's restored. If I just do it again, I can show you the mess it creates on the file system. That's what happens. For every single process got a file with dump of various kernel structures and whatnot. And when you start it, everything is read back and all the processes are recreated. That lets you do live migration because you can move that to another machine and restart them. It lets you do, like say, you had your, I don't know, IRC bouncer or something you don't like losing state. You could totally do that to your container, restart the system to a new kernel, restore it. If you do it quickly enough, your TCP connection might not even have time now. The problem is that, as I said, it's the biggest game of workable and it only works with extremely simple workloads or very, very specific workloads you've tested in fast. And with that, it's the end. I don't think I actually have any time for questions. So if you've got any catch me afterwards, we've got stickers here and on the table downstairs if you want those. Thank you very much.