 Okay, folks, well, we'll start on the final talk for the afternoon. In keeping with what seems to be the security theme of the year, containers, Stephane Graber and Tycho Anderson from Canonical will be giving a talk on the way to safe containers. Thanks. Thank you. Right. So, Stephane Graber, I work at Canonical as the technical lead for Lexity, which is our current main container project, but I'm also the project leader for Lexity, Lexity, LexEFS, all of those things. I've got Tycho right here, who's working on my team, and mostly does the crazy checkpoint restore stuff that some people have mentioned earlier today, and other Lexity work. Right. So briefly, if somebody doesn't know, what's Lexity? So it's a container management tool. It's built on top of LexE. It's not something I believe fancy a new natural guard, we're using the LexE library. It's designed to be quite simple, yet quite comprehensive in what it covers. So it's like, if you've ever used the LexE command line tool and its configuration, it wasn't particularly obvious how to do things. It's massively better with Lexity. It also works as a daemon, which has root, which gives us a bunch of more nice features that we can have compared to the fmworld kind of process LexE used to be doing. It offers a REST API, which is great for scripting. New command line tool, as I mentioned, is great. I don't have to tell you why containers are faster than virtual machines. You'll know that already, so that's great. And we consider it to be safe, and that's basically the subject of that talk, which means we do use just about every single kernel feature we can get our hands off on to make things safe, which does include the user namespace, that's the main one, but we also use all the others, and I'm going to go through those in tiny bit. And it's very scalable in the sense that you can, it was just as well on your laptop as it would on 50 other nodes or something, so you can, with the exact same tools, test on a single machine or go to a small business kind of setup, or you can even then go using an OpenStack plugin and go to thousands of compute nodes as part of an existing OpenStack deployment. So that's kind of what it looks like as far as deployments, usually you have a bunch of physical hosts, you've got the Linux kernel on there. If you care about live migration in any way, you cannot want to have the same Linux kernel and vaguely the same hardware underneath, otherwise things might go pretty badly. Then we've got LexE as in the LexE library, and LexE sits on top of that, uses the LexE library to create containers and manage all that stuff. Then you get the REST API, and you've got a bunch of clients on there, we write command line tool, we also write the OpenStack plugin called NovaLexE, and all you can just write your own or even use curl as I've been used once, I've been demonstrating occasionally our talks and stuff. So while it isn't, it's not a lot of VM technology, it's container based. Also, we focus on your stem containers, so what James was describing earlier, like all of our containers do have a PID one, they are usually like a clean distro image, we support most of the Linux distributions inside our containers. And that's pretty much it for what it is. We don't have any interest as far as application containers, we're perfectly fine with people using Docker or Rocket, in fact, we even support running Docker inside an unprovenged LexE container, if you've got some kind of concern around there, or if you've got problems with your UIDs and GIDs inside images, you can just use an unprovenged LexE container, then run Docker in privileged mode inside there, and that's fine. So security bits, what do we use? So we, the past two speakers have gone through all the namespaces, so I don't actually have to go out again, we use all of them. We've been spending a bit of development time over the past year to do the EC Group namespace one, which was originally done by Google for the EC Group V2 Unified Hierarchy code base, but we needed it to work for the Group V1, so Serge Hadlin has been doing that for us, we've been using that ever since. As far as the LSMs, we, LexE itself supports both a Selenux and Aparma, LexE itself only cares about Aparma right now, because that's what ships with Ubuntu, and we haven't had any contributions from other distros to support something else, so we do generate custom Aparma profiles and that kind of stuff for every container. We do drop some capabilities, but again, we are running full system containers, we're not running a tiny app thing, so we can't really know what the application is going to be inside a container, and we really don't care. So we only drop capabilities that you will never need. So things like right now, Mac Override, Mac Admin, SysTime, that kind of stuff, load module, that kind of thing. But everything else we basically need to keep, I mean, we need to have SysAdmin in there, we need to have Capp.admin in there, otherwise your container just won't boot. And we use Cgroups, and a fair amount of that talk is going to be around Cgroups and why they're great and why they're really bad. And hopefully what we can do to try and make them better. So and that's the Cgroup parts kind of thing. Resource limits, that's something I spent quite a bit of time over the past year doing it next day. We needed a user friendly way of defining what kind of limits you want on your containers and have that applied to them. So what we do support is things like CPU limits. So that means limiting, either giving a number of cores you want, and next day we do the load balancing for you and do CPU pinning on a darn thing with the CPU set controller. Or limiting CPU time either as a percentage, which then expands to CPU shares, or as a time budget, say 10 milliseconds out of every 150 milliseconds, which is then set as a CFS quota, or so should the CPU controller. For memory, we support the usual set, we don't expose things like kernel memory directly to our users because nobody can figure out what the right limit is for those. But for normal amount of memory for the user space applications, we let you set quotas either in percentage or fixed quota, then you can turn swapping on and off for every container. And if swapping is on, you can even set a swap priority to specify basically what container is going to get swapped out first. We do that through the C-group memory controller using swappiness and the limiting bytes kind of limits in there. We also do disk space, that one is a bit, it's not actually tied to C-groups or any other. It's basically if your file system supports quotas in a way that is vaguely useful for containers, we will set that for you. Right now that means the RFS and ZFS support those quotas because we can do subvolume of our file system depending on the naming of the future quotas. Everything else, not so much. If you use the LVM plus XFS storage back end in XD, in theory we could do quotas, but shrinking a live fast stem doesn't work so well, so the user experience is not so great and we've not spent many on much resources in there right now. We also do network IO limits. Those basically end up setting Q-disk, TC kind of rules to slow down your incoming and outgoing network traffic for containers. Also not particularly interesting, and we also support something like for CPU, for block and for network, we let you set an overall priority thing, so if your host is under load you can choose what containers are going to win as far as the schedule is concerned. That's so pretty useful. It's a basic score that the user can set from 1 to 10 and we set the right magic. Last one is as far as giving random people access to containers, what's the most important are the canine resources. Right now, and that's a bit annoying, but hopefully we can make some improvements there. The only C group that's related to that is the PIDC group, which was introduced reasonably recently, it was like 4.2 or something like that, which is great to prevent like a straightforward fog bump, not so great if you're trying to run the canine out of any other shared resources. Yeah, it's not that great of a bunch that we know of that people have usually had to work around because you can actually hit those by accident, not even just when someone's trying to attack you, but just by accident. Inertify handles, that's been a bit of a problem because you only get 512 per user because, so that basically means 512 to share with all your containers, which it turns out is not quite enough when you've got any system that really likes to use them like SystemD. We were getting to a point where you would only be able to start like 15 containers before you run out of Inertify handles and well written software should fall back to doing polling, SystemD doesn't and it just basically fails, you just have init running but nothing underneath it. So most people have to bump that right now. This one has actually been worked on, I believe there's a patch set to tie those to user namespace because they don't have a real reason for being a global limit, it's more like a safety net kind of thing than anything else, so tying that to user namespace will then give us that kind of limit per container instead and that's going to solve that one. There are a bunch of network tables that are the same kind of problem. I was giving a talk yesterday, I run a security contest CTF kind of event every year in Montreal. Part of that involves running a really reasonable amount of containers like 15,000 or so, that simulates the internet and so we end up with about 3.3 million writing table entries or something else, which doesn't fit in the default limit and so that means an unprivileged user can, as could complete nobody user on the system, completely fill your neighborhood table on the counter to the point where you cannot open a new IPv6 socket and there are some other network related tables that are similarly shared that are not tied to the network namespace or tied to a particular interface and that you can similarly completely fill to the point where your system doesn't work so well. So definitely some room for improvement there. PTS is kind of similar. There's a global limit on the system. If people use more than that, then you're going to run out and things won't work so well anymore. Your limits are also a bit problematic right now because unless you use a different ID map for every container, which means if you want to be vaguely post-excompliant, which would mean 65,000 UIDs and GIDs per container, you have the problem that a container can, as one of the UIDs inside the container, if they use resources, then that's going to account towards your limit of the other container next to it to the point where you can run out of FDs or that kind of thing. Despite not having used those FDs in your container as someone else in another container next to you, that happens to be using the same kind of UID that you used it. I think one of the main problems we were running into was related to a Vahee, which basically you could only run ever running a single container because it was setting your limit and then nothing worked. So that'd be nice if we can get that namespace somehow, but it's not completely obvious what we could try it to. So those are the main issues we've got. Everything else, as far as on a recent kernel, perfectly up-to-date, et cetera, et cetera, there are no ways of escaping the container that we are way off. But you can still DOS host, which is a bit problematic when you consider giving root access to some of people to that kind of system. I said I would also mention briefly what we use as far security we do. So right now, it's username space is our main security option. We use that by default for our containers. You can turn it off if you really want to, but we strongly recommend you don't. On top of that, we do have an up-and-down profile, which blocks something. So it basically tags every container with its own profile, and then it prevents cross-profile sending of signals and passing circuits and all that stuff. So it basically is a safety net. In case you do actually manage to escape your container, then you'll still have that kind of confinement applied to you. We also block a couple of syscalls by default using second blacklist for really bad syscalls that you should never have access to. But because we run a full distro in there, we have no idea what other people are going to be using, so we can't go the second-camp whitelist way, which is what we would like to be doing. And now to Tyco. All right, cool. Hi. So one of the things that Lexi does is it does checkpoint restore. And so in doing this, you run across a lot of various oddities. And so I'm just going to go through one of them that maybe somebody in this room knows how to fix. I don't think there's a security problem here, but somebody who knows better can look at it. Basically, if you're trying to write to sysctls, the experience you get from user space is kind of strange. So the various sysctls that control knobs that are net namespace related, whatever the task that opens those sysctls when you write to that file, it changes the values for that network namespace. But in the IPC and UTS namespaces, when whatever the task that writes to that file, that's the task that it changes for. But the problem is that nobody can open these sysctls for the IPC namespace besides real root. So what you end up doing is you have a, when you're trying to reconstruct the state of a process tree that's your container and you're migrating it, you have this daemon that opens this thing and it sends the FD across and then the containers namespace writes to that IPC or UTS namespace sysctl, which is kind of awkward because you can actually just do that with set hostname anyway. So there's a little bit of just like, it just doesn't all quite line up. So I don't know, maybe somebody knows how to fix this. There are a few other examples that I have like this, but this is just one that I thought I'd mention before going into more stuff. So checkpoint restore is sort of similar to the antithesis of security because you need to do a lot of things that are privileged operations. And so one of the things that Kase is laughing is I switched to this slide. Last year at this conference at this time, I was implementing an option called ptrace-syspen-seccomp. And basically what happens here is when you try and checkpoint a container, what the checkpointing tool does is it, since not everything, like all of the processes state is visible from the outside world, like you can't, even as a root process, you can't look at some particular values for a process. It injects some code into that processes address space and runs it so that it can scrape all that information and then sends it back over. The problem with this, of course, is that if you've used mini-jail or some other container tool to apply a sec-comp policy that potentially prevents open or read or whatever, then that task is immediately killed when you suspended and inject this little bit of code to scrape its state. So what you need to do is temporarily disable all of these security mechanisms so that you can inject this process and scrape the state, which is why everybody's laughing. Yes, so we have this for sec-comp and I've been kicking this down the road for other security features, but recently we had a report in the wild of another application's app armor policy blocking things. And so we need a similar option where you temporarily turn off the LSM if we want to be able to checkpoint and restore these kinds of things. So please don't get mad if I send a patch that does something like this in the future, that's why I'm doing it. So what is this little bit of code that you inject need to do? One of the things that is needed is you need an open file handle to the valid slash proc so you can inspect that for the inside the container. And so in the worst case, if the slash proc for the container doesn't match the PID namespace that it's in, you need to mount proc. Of course, if the container's unprivileged, you can't do that anyway. And so there's some cases you just can't handle, but that's kind of a worst case. I mean, we also need to create and connect to Unix sockets so we can send all this information back out of the container. So there are some things that are mostly reasonable that you might want to do from this little blob of code that is injected. Anyway, and this last bullet point is case when I think we were finally close to agreeing on this. This is what he said. This feature gives me the creeps. So I'm sorry, but it's pretty cool. The end result, not the middle part. And so there's some other work also in the sort of in the security realm that's to allow checkpoint and restore of nested namespaces in particular, nested username spaces, you need a tree to be able to inspect what that hierarchy looks like. And so that's also in progress by some guys at Virtuoso. But that's all I have, so please don't kill me. So, since we have a bit of time, let's go with demos. A few different things. You need to switch to right terminal over there. There we go. Okay, so first thing, I've got a lexity system right here. Can see it's running a bunch of containers. Different distros we've got. All point Linux, CentOS, Debian, demo container that's open to base, I think, and streamer container that will play with a bit data. If you, like if I look at that particular machine, I've got 18 gigs of RAM in there, and I've got some amount of space 220 gigs or so. If I go into one of those, all of those are in previous containers, I can actually just prove that real quick if we go down to wherever they are. Come on, there we go. We see that they're all running as UID 100,000 or just a bunch of zeros because it's getting the first character apparently. But they're all running as 100,000 in there. And so if I enter one of those containers, I don't think that they'd be in one. We see it's got the U-ZFS pool I'm using right there, so that's 50 gigs or so, so that's different than those 46, apparently. And I've got all the memory for the host, and I don't have any kind of limit, so if I was to run, say, a fog bomb in there, it would be kind of bad right now. But we can easily fix that stuff, so we can set, we've got a bunch of, that's the abstraction we built on top of C groups because people can't figure out C groups. So let's start with the first interesting one is if you got CPU info, we see we've got eight CPUs. Now if I was to do limits, CPU, just want three of them. Go back in there, and I'm done two, three CPUs. Container is still running. We apply all of those things live. We don't need any restart or anything. So that's our U limit CPUs. Now if I wanted to limit memory a bit, 412 megs of memory, and that's it, it's done. Now I've honestly got no idea where that's gonna be working, but because my, that's running, whoops, that's running back home, and my home internet is not super happy right now. Let me just find my usual test URL for downloads, because that's the only way I can really show you the network part of that is, that's the usual one, and let's move back to having terminals everywhere. There we go. Let's see if that thing will download. Oh, okay, let's take an Ubuntu container, which does have Wget installed by default, and that's the part where, whoa, okay, so that's working, sweet. All right, so what I'll be doing is I'm just gonna run that thing into a loop. There we go, gonna start another shell, and I'm just gonna do, um, this one is actually using a profile, so profile is like an aggregation of configuration options that you can then apply to a bunch of containers to make your life easier. I can either set it to the command nine, like I did for the first one, or I can even just edit with a text editor, which apparently is Nano on this machine. No idea why. So, I mean, if we go back there, we see it's downloading pretty done quickly, like about 500 megabits a second, which is reasonable. Now, let's do 10 megabits instead. Okay, so I'll just save that, and if I go back here, we see it's going down and down and down. Oh, Instabiorizers are just 10 megabits. And if I go back in my profile, and I bump it to, bump that thing a bit, to 100 megabits, and we're going back up, and what, 100 megabits. So that's pretty convenient. If some one-fuck container is misbehaving, you can just slam down, and you don't have to figure out how to get TC to do what you want, because otherwise, the container will be done with doing whatever it is doing before you can figure out the actual command line to get it to stop doing that. So, that was pretty straightforward and useful. Now, I can also apply, let's take the Debian container. It's got a device called root, which is its root file system. I've got its size property. I can set it to 10 gigabyte. And if I go in there, we're done to 10 gigabyte. Sweet. So that's all nice and everything. Now, you can, as I mentioned, like, LexD walks across the network. So I've got the second host. It walks kind of like it. You've got a LexD remote command line. You can list all the remotes you've got, and then you can list all do and the command you want on remote host by just doing its name full by colon. We see there's nothing on that particular host. But I do have a bunch of containers here. Let's say that Alpine container there. That's doing absolutely nothing right now. How about I move it? Done. And that was actually done with live migration. The container wasn't stopped. So it was just serialized, transferred, restored with CRIO on the other side. The same can be done, and I'm gonna show you another container here. Like, let's take, I can try that CentOS one. I can do stop on the CentOS container, saying I want to record the state. There we go. So now the container is stopped. If we look, there's no more process actually running from it. And if you start it again, it's back. So also it stole states to disk and restored it from there. Pretty convenient if you want to do like a quick kernel update, you can just stop all your containers, serializing them to disk, do a Kexec to reboot really quickly to the new fixed kernel, and then restore your containers. Get like five seconds of time, well, downtime kind of thing, which most people will be fine with. We can create snapshots of containers, and again, just take the CentOS container. So I can create this new snapshot called blah for that container. If I go and look at it, we see it's got a snapshot at the bottom and it's marked as being stateless. Now I can do the same command again with like blah one and pause state for, there we go. And now I can restore that particular container, which is CentOS container to on that particular thing. Taking a bit longer than expected, what's going on? No, there we go, just to get to one. But none of that was particularly visual because I mean, for all you know, that container might have been stopped and restarted every time. So let's do something slightly more interesting. Hold on, what's going on? Did my own internet just decide to die now? That would be really annoying. Just check real quick. Oh, it's just my laptop's internet died. Okay, so that I can deal with. My own internet would be way worse than that. Okay, let me just bounce the Wi-Fi. Okay, talk manager, you can do it. Come on. Oh, okay, so that might be a tiny bit of a problem then. Well, let's see, it's reconnect. Attempting to reconnect now. Scanning. It looks like it just reconnected from me, but the upload decided to crash, so I just need to restart now. There we go, and now the VPN, and hopefully I'll be back in business. VPN connected. All right, let's see if that session survives today. Yay, all right. That output is not as nice as I thought it would be, but it's kind of working. Okay, so I've got a container. All it does is it sends, it's got a tiny service that sends a number every second to another IP. That's the receiver side of it. It's possibly just net cut. That container is what we see here as streamer. So if I was to say, hmm, which one can we do first? Let's do stop. So streamer and state phone. Okay, so it stopped counting as would be expected. And I can start streamer again. And it starts exactly where it was. Now, let's do a snapshot of that thing. And then just stateful snapshots, there we go. Okay, so I've got that snapshot done now. We are at, well, from what you see on the screen, it's gonna be like five, yeah, 56, 68, say, there about. Now, let's do restore of that snapshot. We just went back in time. And lastly, let's just move that thing out of the way. So that container is currently on that host and let's move it to the other one. That's gonna take a little while because the file stem is pretty big. So it first starts transferring the file stem and stuff in background. It doesn't actually stop the task or anything. Once it's done with the file system, it's gonna freeze briefly the task to transfer it state and then we just wait on the other side. We should see it happen now. Okay, so now it's freezing. And restoring. So if I go back here, we'll see it's gone from this machine and it's on the other one. So that's why we need all those crazy kind of interfaces so we can do that kind of stuff. Now back to slides. We've got 15 minutes, that's pretty good. All right, so just to recap kind of what I covered. I suppose I'm concerned that previous containers are perfectly safe. I mean, the design makes them safe in that you cannot escape them. Now, there are a bunch of LSMs as you will know that we can use as extra safety net. We do make use of that wherever it makes sense. We even, in the case of up armor, we're working to get up armor stacking working so that you can use up armor profiles against processes inside the container and have those confined in there. That's gonna be pretty useful for people who use those containers exactly like they use VMs, which is exactly what we're going for with FlexD. It's still much easier to do as a kernel than we would like. As I mentioned, you can run the kernel out of PTYs or network interfaces, one routing table entries, that kind of stuff really easily, which will then affect everyone on that host. C-groups do a pretty good job at letting you limit process count, memory, CPU, et cetera. For everything else, the go-to answer is you need to use kernel memory limits. The problem is not all kernel structures are the same size, so it's not particularly useful. If you set a tiny limit, it might prevent you from de-wasting anything, but it will also prevent you from running anything useful. So we need something that's a bit more fine-grained there, whether it is dying more resources to namespaces where it makes sense, or whether it is adding more C-group controllers for other kernel resources that needs to be global, but that we really want to be able to limit for set of processes. We do get a bunch of additional requests because we run those in previous containers and we tell everyone to use those and really not use security privilege unless you know exactly what you're doing and unless you're basically the only one ever running in there, or anyone you give access to that container also has root on the system. So we do get some interesting requests and that's why Ceph has been working on things like fuse inside the container. We do have people who want to mount five systems inside the previous containers. When they use containers like they use virtual machines in the cloud, they like to attach a bug device and mount it. We do have a patch set to allow that for X4, but obviously it's under a flag that the user has to switch to on, which basically means yes, everything that's in there is written by the same company that owns the host and yes, we know that it's not particularly safe but we really, really want to do it. So we've got a flag for that. Fuse on the other side should be saved by design. I mean, it was designed for unprepared users to run five systems to begin with. So doing that inside an previous container makes sense, but so that Ceph has makes that possible. We can then run squash fuse or the fuse equivalent of X4, whatever the five system in there. We don't need to use dub devices or anything. We can just mount whatever we want and it's not as fast, sure, but yes, you can do it. There are some requests that we're getting that are just not reasonable as well. I mean, there are good reasons why some users, some previous users are not allowed to bump some particular CISkettles and having people request that we let them do that inside an previous container is just not reasonable. We could add extra semantics in LexD itself to have LexD, which is root on the host, bump those for you when it makes sense, but not have users be able to do whatever they want. And as Tygo and I covered, Checkpoint Restore is pretty hard. It's working, but it's only working in some setups. And that system was pretty standard. Like all those containers were cleaned was a clean next installation on two machines that have matching CPU flags, so that just works. But if some kind of five systems being mounted here and there cause problems with live immigration, if you're using external resources, you've got some problem there, some network device types, we can serialize right now, et cetera, there are a whole bunch of things that we're basically just getting all the five users to file bugs and then prioritizing what's the next thing we need to get to serialize and deal with for that particular use case. And that's it. I've got a bunch of LexD stickers if people are interested in that. We've got contact info, our website is there. Usually, and as I mentioned, my own internet doesn't work so well right now. My IPv6 connectivity is kind of dead and my USB is trying to figure it out. But when it does work on our website, you can click demo and you get a root shell inside the LexD container. That then has nesting support enabled so that you can start containers and stuff. It's meant for people to try the LexD experience, not for people to try and find zero days inside the kernel. There are a lot of things that cannot tell you as much. There's also a file hidden in slash root on the host that you might be able to read if you find a way out. But it's been working pretty well. We've had so far about 20,000 people playing with that online. Nobody managed to crack the server in any way. We have a few fog bombs a day, not a huge deal because we do set a limit, I think, of 500 processes in those environments. They just use 500 processes and that's it. Yes, they just do it themselves. Congratulations. You can't use that container anymore, but it's not a big problem for us because you can still spawn as many others as you want on the side and it works perfectly fine. Since we pin the CPU, we pin CPU time as well, so they can't actually run us out of resources so easily. So what we have noticed quite a bit is when they try to do that kind of stuff, what just happened is they eventually triggered the out of memory killer and it just bombs everything in the container and whatever. It's not a huge deal for that particular use case. It might be a bit more of a problem if you run like a VPS company, but for the demo website kind of thing, it works great for us. What are they doing? Network and everything is extremely restricted in there so they can't do anything too stupid. They still try, but yeah, not particularly successful. And yeah, that's pretty much it. And Lexity is available in a bunch of distros, mostly Ubuntu because we work for Canonical, but someone is actively trying to figure out how to get it in Debian. It is packaged in Centos and Arch Linux, and I'm sure some other distros have picked it up by now. It is written in Go, but all the container part is actually through Libelix C, which is a C library. So we use Cgo to do all the interesting bits in C and just leave the running a multi-traded complex web server parts to go. And something I didn't show us, but that's also pretty cool, is we do support a whole bunch of pass-throughs. So you can say, like, whenever I plug a USB device, it's got that vendor ID and that product ID, or just that vendor ID, pass it into the container. And what will happen is that Lexity, as a U-Event handler, will notice that device showing up, will create the DevPass USB path in there so that we have C-group devices so that we can write to it, and there you go. I gave a demo two days ago on that where I was doing ADB, so the Android Debug utility, from inside the container, because I didn't want to run like a static binary that Google built for me on the actually online machine. And that's it. I think we've got about seven minutes for questions if there are any. Otherwise, I guess we can just wrap up already. Any question? Guess it's not time then. Thank you.