 Good morning. So, as Pierre David just said, I'm Stefan Krabbe. I work at Canonical on the Elixindexy team. I'm the project leader for all of those projects and have been for an above year's now. Today we're going to be talking about container and container security specifically. First thing first is, like, what are containers really? Container is, effectively, system level virtualization. There's no hardware enablement for containers. The host kernel does all of the virtualization bits. Effectively, the host kernel will give containers their own view of the system. Now, what that means can vary quite a bit. The containers can share some resources with the host. It's kind of pick and choose for a lot of it. You can't only make a container that doesn't have its own network stack but shares it with the rest of the system. You can have two containers share their network stack. You can have two containers share the fight system. You can really pick and choose. And that's then used differently in different container run times. On top of that, you can apply resource limits, which effectively lets you then slice your physical machine into a number of containers, making it easier for, like, planning in general but also giving you bit more of that virtual machine feel where you can dedicate actual resources. Bit of history of containers in general. It's not a new concept. The containers have been around way before Docker has been a thing. Way before that. In fact, the LXC project that I'm leading, it's our 10-year anniversary this year. We've been doing containers in Mainline Linux for a decade now. And containers have been a thing for another decade before that. The concept of containers themselves effectively started in about 1999 when the Russians from virtuoso started looking into the general idea of doing containers on a computer because at that point nobody had containers yet. Then BSD jails were the first implementation of the concept just a year later. Linux v server was the first implementation of the concept of containers on the Linux kernel, which released in 2001. That was then followed by Solaris, implementing their own concept of containers called Solaris Zones. Those were introduced in 2004. And then the virtuoso developers who started all the way back in 1999 released OpenVZ in 2005 as one of, at the time, best implementation of containers on Linux. One thing worth mentioning there, Linux v server and OpenVZ were both rather significant patch set on top of Linux. You couldn't just take your normal Linux system and then run containers on them. You had to apply potentially hundreds of thousands of lines of changes on top of the Linux kernel to get you to have containers. That was a bit of a problem. And that's why in round 2007-ish, there was a number of talks to implement containers on Linux in a clean way, like in the actual mainline kernel. For anyone who's been dealing with the Linux kernel community, they'll know just as much of a challenge that might be. There is no... The leadership of the Linux community doesn't do kernel-wide design. That's not a thing. So nobody decided like, hey, we're going to do containers on Linux, and then we need to change those 50 different subsystems. That's not how things work. Instead, there's been push for subsystem switching to implementing some bits and pieces of containers in a large number of subsystems in Linux that you could then piece together to create something that's like a container. The co-project leader of the LXC project in front of my insert challenge keeps saying that containers are effectively an illusion. It's not a thing. The Linux kernel doesn't know what the container is. There's no such concept. There's like 50 different pieces that user space can use together, like Docker or LXC or whatever else, can piece all of those things together, and that gets you what you actually call a container. That makes for a pretty big definition of what the container is on Linux as a result, but that's kind of what happened. So the LXC project was started in 2008 as the reference user space implementation of containers on Linux. It's been evolving ever since. Every time a new kernel feature would land that could be used to support containers, we were adding that to LXC, and that's the project grew. Docker started in 2013, so five years after that, as effectively a, at the time, pretty much a wrapper script around LXC, giving it the layer and image work for what people are used to, but using LXC in the background. They've since switched to their own implementation of the actual runtime in Go since. And the project I mostly work on these days is LXC, and that started in 2014 as a way of doing system containers differently, effectively. Now, the somewhat even more technical part of this, the different assets, we've got a bunch of different technologies in Linux that when you piece together, effectively get your containers. The main defining feature for that would be namespaces. That effectively is what you use to give a process a different view of the system than the other processes in the system. We've got a namespace for the mount stable, which then lets you use a different fight system for the container. We've got a whole namespace just for the host name, that's the UTS namespace, that's the only thing it does. It's just so that you can have a different host name than host. IPC, which is like separate older shared memory handling between containers. The PID namespace is what you use so that you get an actual in-it process and a process tree that looks separate from what the host has. There's the network namespace that's obviously used to give you a separate network stack with your own if, zero, whatever device and your own loopback. Then there's the username space that other than next D, not many people actually use, sadly. That's the big security namespace. It's what you use to effectively prevent just about every attack against your containers. More on that later. And lastly, the last one we implemented like a couple years ago is the C group namespace, which is to allow the use of resource limits inside the containers themselves instead of just to protect the containers. Then for resource limit, we've got the C groups. C groups that you do things like limiting the amount of CPU time, limiting memory usage, and doing some amount of priority type operations on block device and network devices. There's also a C group that limits the number of processes that a container can run, which is effectively a fork bomb prevention mechanism, because otherwise, in a very easy way to completely kill your host was just to run a fork bomb in a container. And if the PID C group was in configured, then the entire system would lock up. So that speaks to that. Then we've got LSMs. So LSMs are the next security modules. They're used as extra security mechanism to prevent things like file accesses or accessing specific syscalls or accessing some specific privilege operations. The ones that relate to five system accesses are typically APARM or NSC Linux. They work differently, but they effectively do the same thing. They let you mark a number of files in a container as you cannot write to those things. That's how you block the most dangerous files from previous containers effectively. Second comp lets you filter syscalls. So any syscalls that you think might be dangerous, you can completely block it from the container. And lastly, capabilities let you like turn off a whole bunch of kernel APIs in one shot in a very badly defined way. You've got a fixed list of capabilities that exist, but what they actually do is not necessarily well defined. So you can drop a bunch of them and you've got to hope that the next person who implements a new syscall will actually do the right capability check as well or the new syscall might just bypass it and then, well, so much for blocking it. And lastly, we also have process limits and user limits. There's also a concept in Linux that play a bit of a role in there. You could set process limits for your container process tree to limit the number of files that the container can open, for example, or limit size of shared memory and that kind of stuff. User limits are mostly an issue more than like a feature. We'll get on that a bit later as well. But they let you do limits for a specific user ID unfortunately across container boundaries. All right. So that was quite a bit of talking. The first thing we can do, we can just create a container as in from scratch without a container runtime. All right. Let's just unfortunate I don't have another screen to see what you're seeing. You're seeing the right thing. Good. So I've got a folder on my desktop which is effectively a process tree. In this case, it is a Debian Buster root file system effectively. Obviously, my host itself is going to be Ubuntu. So that's different enough. The first thing you'd want to do for a container because it's a different root file system would be creating a creating a mountain space. So we do that. We now have a new mountain space. That's great. Then it's useful to have a mount even if you're just mounting yourself onto yourself because once you do that, now that we've got your... I can actually show you that. You should have your own mountain tree in there. Yep. The last line shows that you've got your own mountain tree now. Now that you've got the mountain tree for your container, you can mark that as the root file system for it. Oops. There we go. It's kind of counterintuitive that until you CD into your own directory, you're not actually in the new mountain tree and so pivot tree doesn't work. So you need to do CD in your current directory. Once that's done, I've technically pivoted. My shell doesn't really know about it yet, but now if I look at what we had before, I've got that Debian version thing here. But if I do CD slash and I do it again, slash it see Debian version now, it's there. So with that particular namespace now pivoted, the slash of that container is the Debian tree. Now that's all nice and fun, but it's kind of useless because it's going to be missing a bunch of systems, like in this case, proc. So we should mount those. So let's mount proc. Let's mount surface. There we go. Having a ptsdevice is always a nice thing to have, if you like having a terminal. Okay, so we've got a new mountain space, the root file semi-setup, and we've got the basic file systems are mounted effectively. Now let's do, if we look at the flags, I'm going to do bid, which is b, and I'm going to ask it, that wasn't the one I wanted, is it? Oh yeah, I wanted fork. So I want new pidnamespace and I want to fork into it. The reason why I need to fork is because when you create a new pidnamespace, your current process itself will still have an ID with the parent instead of within your new namespace. You need to fork. Once you do that, I should see, yeah, some pid1. So now I've got a new pidnamespace and I'm the only pid inside it, so I become pid1, like if I'm in the process. At that point, you should mount proc because it's no longer showing the right view and you mount it again. Once you've done that, then you can look at the process tree and you'll see that you just have yourself as pid1. If we look at the network devices, we see a stillable bunch including my wireless device I'm using right now. We can fix that and share a network namespace this time. And they're all gone. And lastly, because it's fun to have a hostname, we can share a UTS namespace at which point we can do that. And if we spawn a new shell inside it, we've got a new hostname. So that's pretty much how to set up a container using just the basic utils that ship with the next disk for these days without using any of Docker, LXC, or any of the container run times. And that kind of shows you all of the namespaces. The ones I didn't use here are IPC because we wouldn't actually notice it anyways. Also didn't use the C group namespace because same thing, you wouldn't actually notice it. And the last one I didn't use is the user namespace because it is let's say tricky to set up. We've got container run times that do that for us because there's a good reason for it. It's complex. And back there. There are different types of containers because obviously that wouldn't be fun if there wasn't. So on the next, we mostly see three different types of containers. There are system containers which is what I mostly work on. Those are effectively virtual machine alternatives. They don't require any change to what's run inside it. You can effectively take any physical system or any virtual machine or anything you've got and just stuff it in there and it's going to run exactly like it was a virtual machine. Obviously running the whole scandal but that's not actually a problem so long as it's Linux. The other type of container which a lot of you are probably familiar with are what we call application containers. Those are things like Docker. They're effectively a means of distributing applications. And because of the way they're usually managed and to be a bit more ephemeral they tend to be ideally stateless. Whereas your system containers you will normally run like your normal system management tools, monitoring, run your updates, like security updates, all that stuff exactly like a virtual machine. And the third type of container that are around here and most of you are probably using them right now and you've possibly never noticed is embedded container. So that's effectively normal applications which are making use of some of the container features to protect themselves against someone attacking them. So let's look into some of that. Switching back to what's this mean? There we go. All right. So the first thing we can look at is it's a system that's running a Docker container. We're seeing it's a next cloud container that's running. If you look about halfway through the screen you're going to see Docker container be running, then Docker container will be shamed, then effectively Apache running inside it. That's next cloud running for you. That's it. There's no actual process in there. There's no sys log. There's no cron. There's none of any of those things. So that's an application container. It typically runs one or maybe two processes. You don't usually go in and apply updates to the package manager. You just replace it with a new copy of the image. The other type would be a system container. So let's create one of those. So this one's going to be created using open to 1604 as the image. There we go. And if we look at that, it's starting now. So it's the tree at the bottom. And it looks a lot more like what you're going to have on your laptop or normal server. See it's running in its process. It's running journal date. It's running Udev. It's running a DHCP client. It's running a sys log server. It's running cron at all those things. You can enter that container. You can apply package updates, install packages as you normally would. You can run your configuration manager. It's effectively exactly like a virtual machine except it's extremely cheap compared to a virtual machine. And the last type of container would be if I look at my own laptop. And oops, this one is a little bit messed up because that was my dead young thing. Let's see now. Yeah, there we go. Okay. So cron, for example, let's look at one of the processes from cron. So if one thing you can do on the next, that's not always super easy to compare. But like if you look at prox self, it's going to give you all your namespaces and an ID for all of them. Now let's do the same thing with the process from cron. There we go. And if you start comparing them, you'll notice that chrome's got a different network namespace. It does have a different PID namespace. It does have a different PID for children namespace, which comes with the PID namespace. It does have a different username space and it shares the UTS namespace with the host. The reason why they do that is because of the sandboxing. This thing is effectively a rendering process for chrome, which means it's got no reason to ever connect to the internet. So even if you try to attack the rendering process, you won't be able to get out. The only thing that this process will see, if we go attach, yeah, we want the PID namespace from whatever process it was. Okay. I just need to un-share a mountain space so that I can mount a new copy of prox in there. Okay. So that's what the chrome process would see. It only sees itself running as the user. It doesn't see any other processes. It cannot try to attach or do anything to any other processes because it just doesn't see them. Now let's do the same thing with network. What can it see as far as network? Well, an un-configured loopback device, that's what it sees. It doesn't have any kind of network access whatsoever. It can't, that also means it can't connect to any abstract UNIX sockets either. So there's no way for it to connect to the network. And the last thing they do, and we can try and look at that through that, they set up a user namespace. And what they do is in there, they map your, they map through your own user into their namespace, and they don't map anything else. That means even if the process has some way of trying to escalate to root, it can't because there's no root. Like there is literally no UID0 in the Chrome sandbox so they cannot switch to root, which I mean it's an entire other class of potential problems. So now that we've done a different type of containers, let's do the other step of that, which is going through some of the containment options. A lot of, like the Docker container I showed earlier, for example, is what we call a privileged container. That means that if you look at who owns what, and I'll show you that again in a tiny bit, you'll notice that some of the processes in there will run as actual root. Some of the processes in there will run as like UID100 or something like that. Those are privileged containers. They are not root safe. They are a security nightmare waiting to happen. Because if you get to root in there, you are real root. And if you find any of a number of misconfiguration or bits that are not up to date or missing security confinement in the kernel, you can escape the container and you're going to have root on the host, which is a bit of a problem. Then you've got unpublished containers. That's what Lexity will do for you by default. That's why we created Lexity in the first place. We wanted to use the newer security features available in the kernel by default in something that users can understand and use. Unpublished containers mean that we take a big chunk, usually 65,534 UIDs and GIDs, completely outside of the range of normal ID to use on your system, and we map to that. So UID0 in your container is actually going to be like UID1 million on your host, meaning that even if you do find some way of... First of all, it means that root in your container is only root against the things it owns itself. So even if it can see something from outside the container, it has no control over it. Second thing means that if you can somehow escape the container and get on the host, you're going to have just as much right as a nobody user. That was the real fix in the Linux community to make containers effectively root safe. That's why we really pushed for people to use those containers. Then you've got a third step, which is what we call isolated containers in next day. That means that not only do we use the mapping technique, but we use a completely distinct map per container that completely excludes one container being able to attack another or one container being able to DOS another because they do not share any UIDs or GIDs whatsoever. For any kind of environment where you allow completely random strangers to access containers and potentially get root inside containers, that's pretty much the only thing that's safe. Otherwise, it will be safe in a container and that it will not be able to attack your host, but they will sure be able to DOS themselves. The other thing that's worth mentioning here is for those using Docker, there's a dash-privileged flag. So don't think that because you don't pass it, your containers are unprivileged. That's not true. Docker is privileged by default. That means it's so privileged that it's like a security nightmare. It effectively turns off a number of default confinement options on top of their privileged containers. So the limit in mitigation they had in place for privileged containers to be kind of saying is effectively turned off if you pass that flag. There is a way of running Docker and privileged. My understanding is that a lot of the layers and images effectively don't really work well with Docker if you use that. So that's a bit of a problem. One alternative that works is actually running Docker inside the Electric Container because then it thinks it's privileged but it's not. And things tend to work quite a bit better if you don't, actually. At least it works a lot better than running Docker and privileged itself. So a few things about why privileged containers are a terrible idea. And it's kind of your grocery list of stuff you need to address if you write a container runtime to make it vaguely sane. The first thing is you need to be using apanol, to restrict access to proc and sys. The reason for that is there are a number of files in proc. For example, the Udev handler or the code of handler that let you set a path that if you do some specific actions will be executed by the kernel as real root. Outside the container without any confinement. So if you can write to those files, you're screwed. You just trigger the action and you run as real root on the host. So those files must be blocked at all costs. The other thing that's always kind of funny is sysrack trigger which if you don't block then the user can do like a memory dump or can reboot your system or do other potentially interesting things like that. That one you usually want to block as well. But there are a whole bunch more of those. A privileged container that's not properly restricted in proc will be able to reconfigure memory mappings. There's probably 2,000 files in the proc sys that are a terrible idea just thinking about it. Slash sys is kind of similar. It does allow to trigger Udev which combined with the proc issue I just mentioned lets you run any code you want as root on the host. But it also allows you direct PCI access which means you can effectively dma your way out of the container. So you need to block slash sys as well. Then we've got some interesting system calls let's say like legacy system calls that have been causing issues to people in the past. There was a exploit a while back called shocker which was effectively a way of getting out of the container. It was using a little used sys code called open by handle art. That's cool. It's pretty interesting because if your container is based on a bind mount you can get a handle on a directory inode. And then the kernel will very nicely let you ask for a relative path to that. So say you're in your container and you open slash you get the inode and then you use open by handle art to ask for the slash, the slash, the slash, the slash. And what you get is actually a handle onto the real slash of your system which then lets you throw it into that and you're out of the container and your real root out of the container. So obviously that sys code is now blocked everywhere that thankfully we've got SecComp so unless you're using like a completely new container runtime that someone wrote over the weekend, you're probably fine. Most of the big container times are fixed out. There's been severe sign for that and it's been fixed for quite a while. The other obvious tax is if you don't configure C groups properly, your previous container will totally be able to make node dev SDA and then can write directly to your disk. So that's bad. And same thing with TTY devices. So the container could make node dev TTY one or something and then they can directly read and write to any of the TTYs on your system, making it very easy to drop on you, get your password, inject whatever they want in your shell. It's great. So obviously you need to configure that properly as well. Then there are two kind of like big hammers you can use to block a bunch of things. Privileged containers by themselves should be fine if no process runs as root in the container and you don't have any set UID binary that lets you go back to root. At that point, you don't actually have any privileges. You can still do some amount of dosing by using your limits, but at least you can't escalate it back to root. So that's good. That's the approach that's effectively used for applications on Android. Each application runs as its own UID and JID and that's how they do most of the privilege separation. The other hammer I mentioned earlier was capabilities, which you can drop a bunch of them like you can drop CapNet admin or CapSys admin that will block a lot of stuff, except when they don't. So it really depends on everyone in the Linux community doing their homework whenever they add a new kernel feature and adding the writes is capable check in the kernel. If someone forgets any, then there's no capability check and you can run it. It's a useful tool, but not something you want to entirely rely on because there's going to be just a number of cases where it's just not going to do anything for you. And some of the other issues that we have effectively fixes for. The first one, as I mentioned, is you can set a user limit in a container. That limit will cross container boundaries because it's tied to a given UID. So you could have two containers, they each run the same workload. If you've got control in one of them, you can set a user limit on your current UID, which is going to be the same UID in the other container, and then you could set the U limit that says I can only spawn one process and that prevents the other one from ever spawning a sub-process. The kernel just ties it to a global UID and it crosses into any containers. So you can have a container affect the host effectively, or you can have a container affect another, unless they are isolated containers, which means unprivileged with none of our lapping UIDs and GIDs, in which case you have that issue. There are also a category of attacks that have been fixed somewhat recently that was effectively a timing attack against executing processes inside the container. The way a process is injected inside a container is by attaching it to its namespaces, then dropping a number of privileges, applying the security labeling, and then at the last thing, change to the right user in the container and then exact the task you want. The problem you've got is between the time it enters the container and the time it execs, it is technically in the namespace of the container, but still runs as real root. And because of the way security checks were done in ptrace, a process inside the container could see that process appearing as real root for a few milliseconds, and if it could ptrace it immediately, it would be able to do code injection into it while still running as real root. That's not been fixed so that root inside an unprivileged container could not ptrace a process that's got a user that's outside of its namespace. So, that's been resolved for unprivileged containers. It has not been resolved for privileged containers. So, you could, if your root inside a privileged container, you can sit away, do that, and get to inject code inside a process that still has a number of references to the host. The other thing that I believe fixed in most of the runtimes by now, we keep finding new ones every month or so that don't have it, is TTY pushback. So, that means that unless the container runtime effectively creates a pair of console devices, one that's inside the container, one that's outside of it, and then mirrors traffic between the two, if you don't do that and you don't strip escape sequences, the process inside the container can write escape sequences that actually inject code inside your prompt from the attaching process. So, the container could effectively send the escape sequence for background the task, rm-rf slash exec. And as soon as you exec the task in the container that gets, that backgrounds your shell, runs that as your current user, and you're done. Alexi has had fixes for that for like the past three or four years now, but we noticed like RunSea didn't have a fix for that as of like two months ago. So, it's still something that like every person who writes a container of runtimes got to think of all those things because they are not kernel security issues, they are things that user space must do right. And there are a lot of those. All right. So, another quick demo. I just want to show what privileged, unprivileged and isolated containers look like. So, switch to that. I'm going to switch to that system. So, right now, we've got one next-day container running. It's C1. If we go see it's process tree, which I think is at the bottom, we see it's, if you look at has been in it, it's running as UID 1 million. So, it is an unprivileged container. It is running with a static map. So, if I launch a second one called C2, it's going to use the exact same map as the first one. And there we go. We can see the beginning of C2 and C1 both running as 1 million. So, that's not so great because C2 can technically DOS C1. That's why we've got, we've got a specific flag to say to please give me non-overlapping. And if we see now we've shifted by 65,546 UIDs for that last container at the bottom. So, now that one cannot affect C1 or C2, and C1 and C2 cannot affect it. The last category would be a privileged container, which they can also spawn. We obviously don't recommend it, but there are few use cases when you might need that. So, now we've got a C4 container at the bottom that we see is running as real root. So, that's your three container types running there. Now, virtual machines and virtual machines ish. So, effectively like why would you use containers instead of VMs? It's something that keeps coming up, especially when you consider things like system containers that run the same thing as a VM. Well, I'll get into some more details of that, but density is a thing. You might be able to run, I don't know, 200, 300 VMs on a system. You might be able to run 20,000, 30,000 containers on a system. So, obviously we're talking mostly idle workloads if they all, if you spawn a container that runs 100% CPU, doesn't really do any type of, that's like purely CPU bound, that will perform the exact same way as a virtual machine. But if you're looking at doing disk access, network access, like any type of Ios, you get much better density. The other thing where you get massively better density is because containers, when they're idle, do not get interrupts. Those are handled by their kernel in general. So, you don't need to like keep having to reschedule a tiny bit of time for every one of your VMs to just go see what's that last interrupt on the network card or whatever. That does let you run a ton of them, especially idle ones, which for a lot of people doing things like web hosting where, you know, it's like your family website and it might get like one visitor a day or something, you're going to be able to run a lot of those compared to giving even a small virtual machine for a single one of them. The other thing that's been a thing somewhat recently is effectively running container images inside the virtual machine. So, that's changed the name a few times. It used to be clear containers by Intel, and I'd have something. They changed the name again. Anyway, the idea is that, hey, we've got that pensier hardware stuff. How about we use that for security? Which, fine, you've got the hardware. Yeah, that's great. Hypervisor attacks exist. There's been CVs against QMU and the other hyperversals reasonably often, specifically for things like virtual devices, which you need for a number of that stuff. So, saying that it's like that it fixes all your security problems is not quite true. In one case, you have to trust the Linux kernel. In the other case, you've got to trust the user space hypervisor that will run as root and will let you also attack the kernel. The main downside to that is obviously you lose the density because you'll be running virtual machines. So, all of the normal traders of a virtual machine apply. But you also get some of the other, you also get the constraints of containers. Like you will not be able to easily see the entire process tree like I did. You're going to see a single process running. You'll also won't be able to do things like sharing like a GPU very easily between multiple containers. You won't be able to pass USB devices because all of that stuff, you're dealing with virtual machine, you're going to have to deal with like virtual PCI buses and all of those things. And doing all of that, you're still running a container image on the same kernel inside the VM. So, you still can't really run other operating systems. So, you're not really getting the usual gains of running a virtual machine if you don't. So, it seems like a bit of a bad compromise, I think. Like, I don't do virtual machines if they make sense or do containers if they make sense. Like, don't try to have that in between things. But I understand that for some companies, it lets them get the right checkboxes on like security audits and stuff. And that gets them to run containers, but not. So, that's a thing. Just want to go through a couple of really use cases like for effectively container security done right. So, using all the security bits I mentioned before. One of them is NorthSec. So, in here, you're just using Wi-Fi. That's fine. In the other room that we are setting up, we are running the CTF over the weekend. We run a lot of containers. So, we effectively simulate a full internet for every single one of the teams. We're going to have 50 this year. The numbers I've got here from last year. That means running 250 or so BGP routers per team. Actually, I have the numbers here from last year. So, yeah. We were running a total of 272 containers per team. 12,500 and something containers total. Running on four physical systems. We were running millions of routes, very complex network using the GPR over the place. We were also running a number of virtual machines last year because of some Windows workloads. So, we're running that off-site and connected to our containers. The thing that's worth mentioning is because it's a CTF and the entire point is to, you know, try to win it at all costs, pretty much. People are, like, we can't assume that they're going to play nice. So, we do have one container per team that's obviously using its own UID JID map, as I mentioned earlier. Inside that container, we've got 270 or so subcontainers for all of our routers and for the actual contests, like actual challenges in there. And all of them also have their own UID and JID map, just in case. So, I think we assigned something like 100 million UIDs and JIDs per team or something or not. So, we've got plenty, we're fine. Really trying to make sure that, and we also do it so that we limit obviously the memory and above processes CPU. We do actual CPU pinning for every team so that they can't hug the CPU and affect another team. We don't overcome it in general, like we just don't. We've split the physical systems equally, and yes, you're welcome to try and run a bunch of fog bombs or try to use a CPU, and the only person you're going to affect is yourself. So, have fun, but it's going to be really slow. You're going to manage to make yourself go offline, but the team next to you is not really going to notice. So, that's actually what kind of looks like. We've got four physical systems, those four physical systems we've then divided into four virtual machines. The reason for that is mostly so that if the kernel blows up, we only lose a 16th of everything instead of a quarter. And then on top of that, we run between four and five simulations per VM, and in there, we've got about 300 containers. So, that's what the container listing kind of looks like for last year. So, we had 244 BGP riders, we had 32-ish challenges, and we had an above template we could use to create new challenges. That makes it pretty easy for us to copy systems. We can do any Linux distribution that you want. We can run a challenge in a distro that's not been updated for the past decade if we need to. Like, we've successfully run CentOS 3 on my laptop inside the container. That's a thing. We're really getting some pretty crazy stuff, and we definitely do. Obviously, we very much don't want anyone to be able to get real roots, so we try to make sure that doesn't happen, and it's never, it never did. We've been running containers for NorthSec since before NorthSec was called NorthSec. Never had any problem with that. The other thing that we run, or that I run that's somewhat interesting, is we give root shells to everyone on the Internet. So, that's on the LexD website. We've got a way for you to try it online, so you just click a button and you get a root session terminal in your web browser over JavaScript. And in there, we actually run a LexD daemon that you can then create containers in. So, you're effectively in the container, and you create subcontainers. We've had about 10,000 people do that every month since October 2015. We've not had a single person manage to escape it. At the beginning, we've got, we had a few people try to de-o-sit unsuccessfully. The only thing that happens, like, if you try to run out of memory or if you try to run out of processes, the kernel out of memory killer will be very happy to kill your entire container, and you're just out. But everyone else is perfectly fine. We run up to 32 users at the same time on the system. We give them a single CPU. We give 256 megs of RAM, I think five gigs of disk space or something like that, and I think we cap it at 200 processes. So, if you exceed 200 processes, then you just don't get any new bits. And if you exceed your memory allocation, random stuff is going to get killed by the kernel for you. That random stuff might be your terminal. All the containers are obviously isolated. The machine itself is throw away and reapplied every day just in case. We've not had any problem with that except in January where we took the service down for a few weeks for obvious reasons. Now, just a quick recap before we take questions. First thing is, like, not all containers are created equal. You do have to keep in mind that you've got privileged containers and privileged containers and isolated containers. And you do have to keep in mind that not all container runtimes are particularly created for their doing. And that privileged containers, there's a big variety there, too. You might have a privileged container like the ones that we run in next to you, for example, where to the best of our knowledge right now, there is no way of escaping them. If you run it on Ubuntu with up armor, all those bits enabled, which we do by default, there is no way of getting root on the host that we are aware of at this point. That's not to say there won't be. And if there is, there's no guarantee that we'll be able to fix it, which is why we don't credit route safe because it is not safe by design. It is an unsafe design that we've patched as much as we can to make it safe. That's also what Docker does by default. So there may be something you can do there. And that's really the best case scenario. Like if a container runtime forgot to block open by handle art, then you're out of the container. If one of those runtimes didn't set up up armor or the sysadmin turned it off for your system somehow, then you're out of the container. Same thing with C group, same thing with any of those things. If any one of the systems I showed earlier is somehow disabled on the system, and you manage to get root inside that container, you're out. So those are previous containers. Unprevious containers, as I said, the main attack you're going to find there is being able to DOS either a process on the host or another container next to it. That's still a problem. And you can still make quite a bit of damage with that, but it's not a straight path to root on the host. And then isolated containers for those, there are some edge kernel limits you can try and attack, like I don't know, some TCP buffer, some of that kind of stuff. You can try in DOS part of the network stack maybe. We don't necessarily have all the right tools in memory limits necessarily. Like not everything is accounted perfectly right. So these days I've not actually found anyone be able to DOS a fourth, 15 kernel. The memory C group has gotten a lot better and actually accounts for most of that stuff. The other thing to... So that's the attacking side. So, yeah, effectively, if you find yourself inside the container, don't assume that because you're in a container, it's safe and there's nothing you can do. There's a good chance there's something you can do, and you can go further and try to get access to the host, which will then get you access to all your other containers. That might be interesting. If you're the sysadmin, you might want to audit your things so that someone can't do that. Just a thought. And if you're running production containers, well, really, kernel life-patching is your friend. You really should consider that because if you don't and there is a nasty namespace-related CV showing up in the kernel, you're going to have to reboot your systems immediately. You might have a workaround of blocking the specific syscode that's being attacked, but oftentimes those syscodes are going to be rather useful. So, if you're life-patching, then there effectively will not be any downtime or you're just going to have to block your users for a few minutes until the life-match hits when the CV is no longer in the embargo. If you don't have any such mechanism, you should effectively have all your systems checked for security kernel updates every 15 minutes or so. And if there's one, make a post-update trigger that reboots the system immediately. Otherwise, you can be attacked pretty easily. The other way is obviously never allow anyone to that you don't trust inside a container, but that's not always practical because people like to run web servers and whatnot in there. And those can usually be attacked easily enough unless you perfectly trust all the PHP code that everyone you're hosting is running, but that seems somewhat unlikely. All right. So, I think we've got about 10 minutes for questions. There's also the link to that website I mentioned earlier to try it online. We also have like a bucket of lexity stickers in the entrance on tables and at the bar. So, if you want to get some of those afterwards, you can do that. And contacting for instance.