 Let's start again. So we're going to talk about today's containers. Sorry about the screen. We're talking about container security. And as I said, I originally wrote a coloring book last year to explain what SE Linux is. This year we wrote a coloring book to explain containers. But really one of the key things about the container coloring book is how we go through and explain the security of containers. So chapter one. So when you're looking at, grab a coloring book when you come in. I'll grab both coloring books. So chapter one, what we're covering here, often when people look to go to containers, one of the first questions from a security point of view they ask me is once you use containers versus using VMs. So really what we're trying to explain in the first section of the coloring book is the difference in security between different ways of running operations. One other thing, when we look at the, I'll get to that in a minute. Chapter two, we're going to cover what platform should your host container. So looking at what is the host operating system you want to run your containers on. Number three is container separation. So most of you guys probably came to this talk thinking that we're going to mainly cover container separation. We're going to spend the bulk of the time on that. And then chapter four is actually the thing that most people don't think about which is critically important is what content are you going to put in your containers. So let's get started. So we have the container coloring book and one of the, this is the whole glossary for the container coloring book. The data pig equals an application or a service. So when you have multiple pigs, there might be multiple services running inside of a single application. So think about one pig being a Apache web server front end and another pig being a database back end containing your credit card data. It's usually an example I use because when you're talking to big retail companies they're really concerned about this now. I did hit the button. Okay, so just think about it. Every time we talk about pigs we're talking about different services running on a system. So the traditional way you would, so basically now we're going to look at where should you run your containers, where should the pigs live. So the traditional way, the most secure way you can run the applications is the way we've been running applications for years and years and years and that's on individual services. So if you run one application, Apache front end on a one server and you run your MySQL or MariaDB database on a separate server if the first application, the web server gets hacked into do you have to reinstall the second machine? I hope not because that would mean that any company that ever gets hacked into would have to reinstall every single machine that touches their network. So we have a belief, a strong belief that if one machine gets hacked into you don't have to reinstall for other physical machines. So this is the traditional, most secure way. Nothing will ever be as secure as this as we go through it towards virtualization. So as soon as you go into virtualization we're giving up some security. We're giving up, we're trading in usability for security. So the second way we've been working on for many, many years is virtualization. So now you're running each application in their own VM on the same physical piece of highway. So in this case, this is the second most secure way of running applications and as we get into talking about looking at container separation the way that the container is going to break through its separations by attacking the kernel. So a lot of what we're going to be talking about when we get to container separation is about how to shrink the attack surface on the kernel. How can I stop a process from interacting with the kernel in such a way that could activate a bug in the kernel so it could break out of its confinement. When I'm running in a VM environment, we have a very small attack service on the host kernel. So it might have, each application has its own individual kernel. So if you were hacking into your Apache web server but first you'd have to override the Apache web server that's running in the VM. Next you'd have to attack the host kernel breakthrough, you know, SC Linux or whatever, discretionary access control to get to the root inside of the host kernel. Now I'd have to attack KBM. So the KBM interface to talk to the host kernel is very, very small. So I'd have to figure out a way of vulnerability in the KBM interface in order to break to get to the point of I'm now on a process running on the host operating system Now I have the SC Linux wrapper on that so I'm running a state of running S for it. Now I'd have to break through SC Linux and the SC Linux policy on that QMU process that's running your VM is very tight. It doesn't allow much access so you have even a tighter grip on that. QMU also has all the security features into it. So at that point you finally get to the point where you can actually talk to the host kernel. So that's the process for breaking through a VM. When we go to containers, you're already talking to the host kernel. So no matter how secure we ever get with containers, they're always going to be a little less secure than running applications in individual VMs. The reason for that, people see this as a subword. Maybe. I'm going to take a look at this. The reason we're changing a little bit. So the analogy here, we use containers as being a duplex. The closest analogy to running containers is actually an apartment building or a hotel. In that, there's a single point of failure. So the single point of failure in your apartment is based on the front desk. So if I can attack the front desk or attack the kernel and get a hold of all the keys, now I can break into every single one of the containers. So that's the analogy here. And similarly, in a container environment, since I'm talking to the host kernel, if I can figure out a way to bug in that host kernel, I'm able to take it over the machine. Another way to think about this is if you've ever looked at security, they're always talking about local exploits versus remote exploits. Everybody always worries about remote exploits a lot more than they worry about local exploits. Well, in container environment, you are already local. So you've already broken through that. Not only that, in a lot of cases, you're already running a fruit. So you already have root processes in the container. So the next traditional way that we run applications on a server hardware is we run them on the same machine at the same time with no container separation. So I've been feeling famous for saying containers don't contain. That's kind of a hyperbole when pushing it too far. Actually, I'll bring that up later in this presentation. But containers actually contain better than what we traditionally run, where we would run the Apache web server and the database on the same machine at the same time. So, of course, you have SELinux involved, but basically that's a similar to a hostile. So you have some limited separation between the applications. And then we get to the final one, which is SELinux 740, which is the equivalent of pigs sleeping in a park. So as we go down the separation of applications, it all depends on if the application gets hacked into whether or not you have better separation. Everybody understand that? So if we look at, I mean, one of the big advantages of VMs over, I mean, containers over VMs, so I think we've seen a few of these shirts around. So if you go to stopdisablingselinux.com, then at least you won't have your pigs sleeping in a park. That isn't your website? No, it's not my website. Major Hayden, he's not here right now. Major Hayden did that from Rackspace. He's here. He's already seen, he saw this presentation on the Red Hat Summit. So since we're going to be doing a container talk, really what we come down to is a goal of security versus usability and getting things like more and more machines and making it easier to manage. And so containers end up being sort of a sweet spot where you're getting better security than traditionally running multiple services on the same machine, but you're still not going to get as good a security as running in VMs. Sometimes when I talk about this stuff, I often explain to people that the best security for running containers is actually to run them on VMs and then separate the containers out by what they do and what kind of data they contain. So if we went to, for example, we have multiple front-ends for a database that had multiple back-end databases that had sensitive data in it, then you still want to run, say, your containers in a bunch of VMs out inside your demilitarized zone, but take your databases and they will be running inside of your private network also on multiple VMs. So you still separate your VMs, or you separate your containers on different VMs depending on what kind of data you need access to. So we're going to look at the apartment building. So we've decided that the pigs are going to live in an apartment building, best combination of resource sharing, ease of maintenance, and security. So here we get to the traditional story. What kind of apartment building do you want to live in? What kind of apartment building do you want to put your pace into? Really what this comes down to is what kind of operating system, what kind of host operating system do you want to run your containers on? So there's really, and the story of the three pigs is the first one is talking about straw. So one of the things that's going on right now in the Docker community is people are building their own host operating systems. Okay, boot to Docker, for instance. People are throwing together things like CoreOS, people are building random operating systems out there to run their containers. And they really want to get to this point of having a minimal operating system to run it. The problem with building your own is, you know, running containers when you're doing yourself a platform, the problem with running a container and doing yourself a platform is you are in charge of all the security of the containers. Does your platform have SC Linux? Who's writing the SC Linux policy for it? How do you know it's secure? Who's doing updates? The most important thing in a container environment is make sure your kernel is secure. You've got to have good security updates. You can't be taking the latest upstream kernel. You've got to make sure that the kernel is stable and secure and when there's a vulnerability, it gets updated quickly. So if you're running on a platform that you built yourself, you end up taking on responsibility for all the security updates to that machine. So the second best one is where Fedora is and that's community platforms. So if you look at, if you're going to run your containers, the second best option in my opinion is to run it on a community platform and that's one of the reasons we're talking about Fedora, atomic host, which subscribing to that is the platform we want to run on, at least for a community platform. So the nice thing about a community platform, you get better security updates, you get a more stable system, a more managed system. If anything happens, you get guaranteed updates. You can rely on, say, the Fedora community to update it for the limited amount of time the Fedora community updates your host operating system. And then obviously, since I work for Red Hat, Brick is rel. So when you get down to that, if you're running a company in production, bet your business, you still want to think about the host operating system. You want to think of a company that is putting its resources into maintaining and managing the host kernel. So Red Hat, one of the interesting things of all these companies that are building host platforms right now, I always ask is how many kernel engineers do they have working on it? So rel right now, I think there's probably approximately, well, there's probably over 100 kernel engineers that work for Red Hat. So we are putting the investment into maintaining the Linux kernel, that's possible Linux kernel there is, and we support it for multiple years. So if you want to, you know, the last goal, if you want to have it supported, is to run it on a real platform. If you look at anybody else's platform or anybody else's building, containers, container platforms right now, they're just taking the stock upstream Linux kernel and running with it. They have no kernel support, people being added, they have no performance, nothing like that. So you guys think about that, again, think about the host operating system. Okay, so now let's get to pig separation. So this is probably the real meat and the most interesting part of containers. So we start out with the idea that these container processes, the most part we're going to have root access on the system, and what we're going to try to do is take advantage of things that the host kernel, the Linux kernel provides us to provide separation, better separation between the containers. And again, the wolf is broken, it's a container one. He's broken through the web service pig compartment, and now he's trying to attack the database that has credit card data. So this is where the talk, and I covered it earlier, is that containers don't contain. And again, I'm stepping back from that a little bit, and that containers do not contain as well as VMs, containers do not contain as well as services, but containers do contain better than individual services running on the system, same system at the same time, and obviously better than SC Linux disabled. The reason I talked about containers don't contain is I see a trend of people running containers and they're making bad decisions. So as people jump into containers, they start forgetting about all the security things that they've learned over the last 15 to 20 years of using Linux systems. So they're starting to do things like running more processes as root, downloading random crap from the internet, as I said, CMOS famous quote. CMOS in the back needs to be quoted. I use this quote often, actually. And so what I want to tell you is still continue, because containers don't contain as well as VMs, if you have good security practices, you probably should not care. So if you treat container services just like regular services, you need your service to continue to draw privilege. If you're running Apache inside of a container, it still should run as UID 60, or as run as a non-privileged, it shouldn't be running your Apache services root just because it's running inside of a container. You should only run the services as root for as short as possible amount of time. Treat root inside of a container to seem as root outside of a container. Don't make any security assumptions about root inside of a container. The people that really should worry about containerized platforms, if you're running traditional services inside of your container, then containers are right. It's a multi-service, people like OpenShift, and anybody that's running lots of untrusted applications, then you have to worry a little bit more about this. But pretty much what I'm saying here is always treat. If you're going to run Apache on your host system, are in a container that should be treated with the same security concerns. And here's where CMOS calls CMOS in the back. I don't think he said crap, but I had to clean it up for PG-33. So Docker's about running random crap on the internet as root on your machine. I'm going to go heavily into that when we get into the last section. But basically what's happening with Docker is people are going out to Docker I.O. and they're going out and saying, I need a version of Hadoop. And they're just going down and pulling down a random version of Hadoop and they start running it as a root on a machine. They have no idea who created it. They have no idea if there's been security updates applied to that container. You just go out and grab a random thing. So think about that when you're running your containers. So that is basically a bad idea. It's a very bad idea. Again, you're running it in privilege and you're expecting the container technology to protect you. And in a lot of cases it won't. So only run containers from trusted parties. So if you've got your software traditionally from Oracle and you get an Oracle container, don't grab a random Oracle container off the internet. If you got a content from Red Hat, grab it from Red Hat. If you got it from Fedora, grab it from Fedora. Don't go out to some random website and someone says, hey, I got a great new version of XYZ coming downloaded. There's no way you can trust that application. Oh, nice. This is why you don't run right. Oh, that's interesting. So why don't containers contain? Well, I've explained a little bit about the tax service on the kernel. But what we talk about containers, if you read through the container coloring book, you notice this concept of namespaces. And what namespaces do is give you sort of like that virtualization feel. So there's things like the PID namespace so you only see processes inside of your container, you know what I'm saying? Network namespace, you have your own network that's separate. But the real problem is everything in Linux is not namespaced. Okay, so everything in Linux is not virtualized. So if you get everything in Linux virtualized, it becomes KVM. Things like containers are not comprehensive, like KVM. Things like the sys-slash-sys-vousis-slash-proxys, C-groups, SC Linux devices, the kernel modules. You don't have your own kernel, so you're not isolated from changes. If one process can change the kernel, then you're not isolated from it. So all this stuff is not containerized, so what we need to do inside of our container is start to block access to some of these things. So the first thing we look at when we're trying to contain the process inside of the container is we look at, one way that you interact with the Linux kernel is you write to kernel file systems, slash proc, slash sys, slash sys fsc groups, sys fss linux, sys fss, I don't know, there's hundreds of them. So what we want to do is we want, in a lot of cases those file systems have to be inside the container in order for the processes to work. Processes expect to be able to read slash proc, proc sys, things like that. So what we can do is we can actually mount these all read only. So we can mount slash sys, proc sys, proc sys, that trigger, proc irq, and proc bus, all this read only. And by mounting read only, all the process, for the most part you can do almost everything you want to do inside of a container, but we blocked certain access. So most applications work perfectly fine with these mounts, just read only. The next quick thing I'm gonna try and I'm gonna try to race through some of this going on right out of time is capabilities. Linux capabilities was an effort over the last 15 years to take the power, everybody understands root can do anything it wants on a system. Well it's not really true, what root is is the existence of a process that has all capabilities. So the idea of capabilities was to start to limit the amount of control that a process has on the system to only certain capabilities. A little quick traditional, a quick history lesson. Right now there's a rule that you can't bind to ports less than 1024. Anybody know why that is? It's historically, but why did they do that? We'll go a little. You could identify over the user's, some of the code I get the game is to be used to identify who is connected to the port. So any process that was able to bind to a port less than 1024 meant it was set up by an administrator that you trusted. So back in 1972 timeframe when the internet was first coming on and there was only six computers in the world on the internet and they all came from universities, it was thought that whoever could bind to a port less than 1024 was trusted. So you could basically say, oh I can trust that Apache web server or whatever web services that didn't exist. But any services that existed on an internet that was less than 1024 meant that it was dependent from the trust. Anything above 1024 was from those untrusted students. So we have carried that tradition on all the way back to the UNIX vector. So it makes absolutely no sense in the modern world, but we still have that rule that you can't bind to 1024. So that actually, so when we broke our capabilities, one of the capabilities was NetDynePost, which allows you to bind to ports less than 1024. So you could give an application ability to bind to ports less than 1024. Another capability is the ability to send RRIP stockings. So the PIN command only needs that power root, so we broke that down. So originally the kernel developers allocated 32 bits, 32 different capabilities, and started to break apart the power root into those 32 capabilities. The problem that happened is that capabilities quickly grew and filled up the numbers. So what we do in capabilities is we drop a whole bunch of capabilities from the processes. So this is a quick list of some of the capabilities that are not available by default to process inside of the container. So we're really trying to take that power root and drop it down as much as possible. One of the problems with capabilities was that as the kernel engineer started to run out of capabilities that started to pile more and more capabilities, they take one capability and pile more and more of responsibility onto it. So the big one is Capsis Admin, which basically became root. So as your kernel engineer is all, this requires a lot of privilege, I'll just check on Capsis Admin. So Capsis Admin became this incredible, powerful capability. And the nice thing in Docker right now is we can actually remove Capsis Admin. We also remove CapNet Admin. CapNet Admin allows you to configure things like IP tables, rules, firewalls, stuff like that. So all this stuff has to be set up before the container runs. So we kept Capsis Admin, this gives you an idea of all the capabilities that are available to Capsis Admin, but the really interesting one here is Mount. Remember I talked earlier about we wanted to mount file systems inside of the containers read only. Well, if I have ability to mount, all I have to do is remount to read write and now I can write to them. So you can't run Mount command inside of the container, Mount command is blocked inside of the containers. A whole bunch of other stuff here, I can't load kernel modules, things like that. We're all required Capsis Admin. So by removing Capsis Admin and CapNet Admin, we actually dropped the power of root down quite a bit inside of the container. I should show a list of all the capabilities we still allow, which is probably about 10 capabilities that are allowed to containers. They're just things like said UID. Because I want to be able to change from root to Apache user, so I still need to have sets anyway. So the certain capabilities are still allowed. I mentioned earlier namespaces, usually when I tackle on namespaces, I tell people namespaces are not really a security goal, they're just to change the world view. But you could consider namespaces as a security and as things like the PID namespace eliminates your ability to see all the processes on the system. You only see the process inside the container. It gets hard to attack other processes on the system if you can't even know they exist. Similar network namespace basically blocks you into a separate network. So you're on a different network than the host network. So you can take containers, for example, and put a container that can only see your internet and then another container that can only see intranet and maybe a third container that can see no networking but communicates between the two containers. But you can basically isolate containers networking you deal with the world. Seagrups is a, so we've talked about a little bit about eliminating the ability to attack the kernel via capabilities. We talked about ability to attack the kernel through file systems. Another way you can attack the kernel is through device nodes. So what we want to do is we want to eliminate the ability for containers to create random device nodes, okay? So if I can attack on the device nodes, if I could get to the dev, SD, A1, or something like that, I'd go attack the host file, attack the file systems. So if you run inside a container, you only see these devices. So inside a container, you're only gonna see dev console, dev zero, and all of these full PTYU random and random. But those are the only devices that you can usually see inside the containers and you're only allowed to create. If you have to create additional devices, these are the only devices you're gonna be allowed to create, and that's actually controlled by a C group. I believe it should have been a namespace, but this is the way the kernel guys designed it. So you can create a device C group that basically lists the only devices you're allowed to interact with from a container or from a C group. We also mount all images with no dev, which means that if I had a base image, someone installed an image and had the device node on it, it wouldn't be able to kernel, it wouldn't allow it to be used. Now we get into SC Linux. So this is where I usually have everybody stand up. This is the thing no one tends to understand. So let's start, repeat after me. Repeat after me. Okay. SC Linux is a labeling system. I'm gonna have you guys just read it, okay? I'm not gonna say it. Go ahead. SC Linux is a labeling system. Every process has a label. Every file, directory, and system object has a label. Policy rules and control access between labeled processes and labeled objects. The kernel enforces the rules. Okay, sit down. Now you know SC Linux. Okay, that's all SC Linux is. So I have a process label and I have a file label. All right, and I'm controlling the labels. I'm saying that the process label can only read and write container file labels. If that isn't simple enough, now you can grab your textbook, the SC Linux textbook. I don't know why I keep flipping. Must be me moving around. So let's look at it. We're gonna dump this down as far as humanly possible. So we have two processes running on the system. I got a cat and I got a dog. You can think of these as being two containers. Then I define content on the system. I have cat chow and I got dog chow. And then I write a rule into the kernel. And the rule in the kernel says I allow the cat process to eat cat food, a cat chow. And we have a category of food on it. And I allow the dog to eat dog food. So they're able to eat their own food. Cat can eat more food. When the dog goes to try to eat the cat food, it's denied. Okay, the kernel steps in and denies it because everything by default is denied. This is type enforcement. This is as simple as I can get with type enforcement. And if you look at containers, containers run with the same label. They all run with the same type. So I define it as being a container type. I don't know if I show the types. Okay, so in type enforcement, I protect the host from the container processes. So to type enforcement, but I just explained with the cats and dogs, is only protecting the host from the container processes. Container processes can only read and execute slash user files. So the policy basically says that if I mount slash user into the container, then the processes in the container are allowed to use and execute those files. If they... What did it say after? It said it was running on Chrome. Let's see if Chrome crashes on it. So the only type, so that all processes inside the container run with this type, which is SFIRT, LXC, NetT, and then we have SFIRT, Sandbox, FileT. So all the content is labeled. Every piece of content inside of the container by default is going to be labeled as SFIRT, Sandbox, FileT, and that's the only type that container processes are allowed to read and write. So we set the policy to be that type. What we do is we change, Docker changes the labels of the images content to be SFIRT, Sandbox, T, and then it launches a process SFIRT. So if a process was able to break out of the container, it is gonna be blocked from touching slash via, it's gonna be blocked from touching slash home slash root. It's blocked by SE Linux from touching it, but we're only protecting with SE Linux using type enforcement, we're only protecting the host from the containers. If all my containers are running with this label and all the content's running with the SFIRT Sandbox file T, what does that mean? It means that containers can attack each other. So SE Linux type enforcement doesn't block containers from attacking each other. What does blocking in SE Linux is a thing called MCS. All right, so you might've heard the term SFIRT, it's a pure virtualization. So SFIRT was developed, we invented SFIRT probably about eight years ago now to contain virtual machines. So what SFIRT does is takes advantage, another part of the label, the SE Linux label, the last part, you usually look in SE Linux labels, you'll see it calling S zero. What we can do is we can change that to a random, somewhat random category system. And then, so Docker could pick out a random number and associated with the content of the label. So I'm gonna show that in a second. Multi-categories based on MLS, but I'm gonna go through the example. So in the first example, we had cats and dogs, but what happens if I wanna separate all my dogs? Separate, so everything in Docker, everything's in the container. So here we have FIDO and SPOT, and we basically define another part of the label like here, the colon FIDO and colon SPOT. So now we have dog show, the child that's FIDO's, and we have dog, the dog child that's SPOT. We allowed dog FIDO to eat his food, but we're now preventing SPOT. So what this does is it uses MCS labels, I'm gonna show them in. Allows us to protect containers from each other. So container processes can only read and write their own files, and what Docker does is it picks up this random pod at the end of the MCLinux label and changes it to something that looks like that. So each container, if you were running 10 different containers on your system, you would see, and you did a PS command with the host, on the host operating system, you'd see a whole bunch of processes, each one having a slightly different MCS label. Docker then assigns that MCS label to all content and processes. So now it takes that MCS label, assigns it to process, assigns it to the label, up to the content, and it picks a different one. So Docker is guaranteeing that each container gets a different MCS label. This is the same technology that we used for separating open shift, so each open shift instance, and traditional open shift got the same thing, and it's the same separation we used for secure virtualization. So we've been using it for the SCLinux sandbox. So basically I discovered a technology and it became my hammer and all I do is look around for nails. So really all I'm looking for here is how to separate processes with basically the same security realms, but we want them separated. So each one, you know, multi-tenancy, I guess would be the correct terminology. So Docker without SCLinux is tough way without the burp, and this is affiliated for anybody old in the room, they understand the old commercial and they go, yeah, I have no idea what this is. Okay, so future, so that's basically everything that's in Docker as of today. Very soon, very close future, we're gonna talk about two new technologies that are coming on. So Setcomp, Setcomp is basically allows you to shrink the attack servers on the kernel even further, and this time it allows you to eliminate Syscalls. So what Setcomp allows you to do is basically another way you can attack the Linux kernel is through the Syscall table. On a X8664 machine, you have about 650 Syscalls. If a bug can be activated in the kernel via any one of those Syscalls, you potentially could take over the kernel. But if I could eliminate a large amount of those Syscalls and the bug exists in a Syscall that the container process doesn't have access to, I can basically shrink the attack servers. So what Setcomp allows you to do, allows you to shrink the amount of Syscalls that are available to processes inside the container, was actually originally developed by Google for our Chromium browser. What they wanted to do is take the plugin infrastructure, plugins and basically take the Syscall table way down to only the Syscalls that are needed for plugins. It's actually also used inside of SystemD right now, and it's also used in QMU for running virtual machines. A really nice thing that the Syscall, if you looked at the size of the Syscall table on a X8664, it has about 650. If you looked at a 32-bit machine, it has about 325. See some symmetry there? The reason for that is 64-bit machines can run 32-bit code. If I enable, if I'm not gonna run 32-bit code inside my containers, and probably nobody does, I can instantaneously with Setcomp cut the tax service of Linux kernel in half. So now if there's a bug in any one of the 32-bit Syscalls, they are no longer actually available to the process. We're also looking at dropping these other Syscalls. Let's see if we can focus a little bit. So these are a whole group of Syscalls. People have been investigating what Syscalls they can eliminate, but there's ability to load kernel inside code. Open by Handle app is a really interesting one. Darker had a vulnerability very early on. You're able to figure out what the root of what the slash processes I know is. And with open by Handle app, if you can specify the physical I know, you're able to open the file and then walk in directly, even if that file is not mounted into your container. So Darker had a breakout caused by that. So eliminating a lot of these Syscalls, most of those around loading kernel modules and things like that. You can block 32 bits of Syscalls. Another thing we can block is old, weird networks. How many people in the room are using DeckNet? AppleTruck? IPX? Net, IPX, NetWear? Okay, so there's all these weird, old, now most of these are turned off in most modern kernels, but it'd be really nice to just go in and say, I'm not allowing any of these networks if they're in one of these host kernels. So we're gonna block all the weird networks. So SetComp, SetComp has been held up. We've actually had most of the SetComp stuff done for like six to nine months. It's been held up because of the lawyers. So it's trying to get the crop of licensing. Because Darker, it's built on Golang, Golang, it builds static libraries. How do you build something that is shipped as Apache, that uses a static library that has a different license. So it combined a lot of licenses and held it up. But it looks like we're gonna have for Docker 1.9, hopefully there'll be LipsetComp. So it's just about to converge into one of the sub-packages of Docker called Lift Container and then it should get rolled into Docker. The next one is the thing called user namespace. How many people have heard of user namespace? Okay, how many think user namespace is gonna be this great savior of the world? Because I've been fighting that perception for a while. User namespace is sort of like the holy grail of the way people understand separation. So what user namespace is supposed to be able to do and is gonna be able to do is allow you to run root processes inside of the container that are not root outside of the container. So it's a UID mapping where you can basically say I'm gonna map UID 5000 to UID 0 inside of the container. Now I can create all my contents in the container that's being owned by UID 5000 and the kernel does the substitution. So if you're in the container looking at the processes, who am I to say, oh, you're root. And then if you went out and looked at the same processes and you asked the host operating system what that process is, and it'd say, oh, that's UID 5000. So the idea is that you can get some of those capabilities back for the process inside of the container but have them totally eliminated from outside of the container. The biggest problem of user namespace right now is what is two huge problems. There is no file system support. So if you wanted to share the same container image between 10 different containers, each one of the containers has a different thought of what root UID is. Now I gotta go through all my containers and change, chone every single file that's UID 0. You also want to have ranges of containers. So if you wanna still have the separation between root and non-root, say root and Apache running as UID 60, you actually have to set up a range. You really wanna say zero to 1000, it's gonna be 5000 to 6000. So as your scale of the number of containers, your management of doing that Simo's shaking his head in the back of the room is kind of, I know you're gonna get over it, but I'm not gonna explain it. You can talk later about the problems here. But anyways, this is why people think this is the holy grail of separation. But the biggest problem with it is management. So you have all the problems you've always had with managing multiple UID ranges and having different identities. So the NFS mounted into multiple service, things like that. And you have the problem of handling the file system. We have some ideas of some stuff we're gonna be fooling around with over the next few weeks to try to improve the file system situation. But we'll see, that's real primitive. What Docker is gonna be rolling into 1.9 is a very simplified version of user namespace where they're gonna have every UID map to itself. So one is what, map to one, two to two, three to three all the way up and then only map UID zero to something else and all containers which area. So it would provide protection from the host against the container, the equivalent to what I've seen this doing with type enforcement. But that's the current state of user namespace. So a couple things about user namespace. Since it's using UID separation, anything world readable in any one of the containers could be read by any one of the containers. So one of the problems when they were doing SE linux on the Android, SE Android was they found lots and lots of applications are built stupidly. So you have a, my database happens to be in my container, my database is a credit card data and my container is world readable. User namespace separation is not gonna protect you. If you have world readable, that means that any process even if it has non-root, can read your content. MCS, yeah, for those in the room that don't do this so anyways, my goal with all of these is to rip out certain ones and put others in and potentially there's problems like that. Okay, so chapter four of the book and I don't know how I'm doing on timing here. I think I got two minutes, so I got a race through it for the race of this. And this to me is one of the most important sections is how do you furnish a pig's apartment? Okay, and this is where CMOS code really comes about. So I go on and I create my own content that just grab random content off the internet. That is basically the equivalent of picking random stuff off the internet. So I usually like to use the analogy when I first started on Linux back in 1999, the way everybody found applications was they went out and where did you get your software and you would go to yahoo.com or Alta Vista and you would Google the content. So Google didn't even exist at this point or it's leaked to maybe exist in the university. But then you would find your content. So you say you wanted to run the latest version of MySQL, a Postgres SQL. You would find it on IPM find that and you download and install it. Next thing you would do is you'd hear about security vulnerability in Zlib. Your security administrator, admins will come in and say how many Zlib vulnerabilities do we have? And you would say I have no clue, okay? So this is what Docker is doing right now. So now Docker, no one cares about security, it's just go grab random stuff off the internet and install your machine. Docker had a very embarrassing thing that happened to it a few weeks ago. They had a security analysis team went out. I think Docker has about 4,000 or 40,000, some huge number of images available on there. And they found that 30% of the images had security vulnerabilities in it. I will tell you right now that nothing to do with Docker and it's not Docker's problem. When we look at the future of containers and stuff like that, everybody's heard the expression DevOps? All right, so there's two people involved here, right? There's the developers and there's the operators. Most of the people in this room are developers, I'm pretty sure. So developers, what do we like to do? We like to build something, see it work, happy that it works and move on to the next project. Operators are in charge of security. They have to run your piece of crap for the next six, seven, eight years. So what happens is the developers went up to Docker Hub and said, I get this really cool piece of software, it looks great, please try it out. And then they disappear. So now that image sits there and it rots. It slowly rots, shell shot comes out, ghost. All these vulnerabilities come out, they're all bundled into their container images and they're forgotten about it. Whose job is it to update a container in the new world of DevOps? Is it the operator's job or is it the developer's job? The same guy. Well, they're not gonna be the same guy. Okay? The jobs to update the software, instead of, it's always been the operator's job to do security fixes. Now it's the developer's job to do security fixes. So if you don't have a company that's willing to do security fixes standing behind or distribution, something like that, standing behind these containers, they are gonna rot. But right now, Docker Hub has 30% of containers are rotting from the core. What's gonna happen in your company as you go forward? Once you have thousands of containers sitting out there, who's gonna look at those containers and make sure that they're up to date? Remember, just because you update the software on your 50, or 5,000 host machines, the containers never get updated unless somebody goes out to the containers platform and updates the software, fixes the security. So you gotta think about that. So Red Hat Enterprise Linux solved the problem in 1999 and the certified software, hot web platforms. So if you're gonna grab your containers from random people, you're gonna have to rely on those random people to update your software. If you get it from the Flora community, it's gonna be probably pretty good. Right, so we're pretty proud of our, we're gonna maintain our containers, right? Except for what we just announced that after 13 months, we're not maintaining them anymore. So, and again, even in the case of distribution here, the distribution is not providing the application. People are gonna take the Fedora application and they're gonna bundle software on top of it and then they're gonna ship it. And it's up to them, when Fedora, how often do we have updates in Fedora? Seems like just about every day. I have rides, I'm a bad judge, but. I have it constantly, that's why I have a lovely Firefox running right now. But basically, once a week, there's probably updates. At least once a month, there's major security updates that come out. How often are those affecting the container image? How often will we update the container image? And then, if I built an application on top of an older container image, how do I know my container image is out of date? All right, so this is why you need, if you're gonna roll to a supported platform, you need something like RHEL to manage it. I've already talked about Deadbox, jumped ahead. Okay, I'm not gonna show this video right now, but this is a famous video for anybody over the age of 40 who would probably feel you're familiar with this. This is Nancy Reagan, if I started it up, I can actually, I'll start it up. Let's see if we can, you probably won't be able to hear. So this is Ronald Reagan here, okay? Back when I was young, we used to be scared that the Soviets were gonna destroy the United States and we were gonna, so it's called Mutual Destruction, and this is Gorbachev. So back around the 1980s timeframe, early 1980s, they signed an agreement to try to eliminate lots and lots of nuclear weapons. And the famous quote that came out of this was trust but verify. So you can trust your applications, but you need to be able to verify that the security updates are in it. So it's a little bit funny at the end, but since we have no sound in here, we'll just skip ahead. What happens when the next shell shock happens, already covered the security, and you get relatively certified images. You also have read, we're building tools right now to examine images. So we're using SCAP content. We plan on having, probably in the next month or so, we're gonna be releasing a atomic scan tool, which will allow you to scan your images and figure out whether or not they're vulnerable. Right now, the only people that provide content for scanning images are Ralph content. If Fedora provided content, the tool should be able to use it, but I don't think, I don't know if Fedora provides any SCAP type content at the time. So this is the last slide. You won't see this picture in the container coloring book. Mo Duffy, who did the thing, said this was too violent for little kids, but I actually wanted it to find more violence. I wanted them on a stick with the wolf rotating. But basically, let's go to questions. Sorry, I tried to race through it, but it still took me a full 45 minutes. Anybody have any questions? I think we have a few minutes to get to the next session. Go ahead. Lunch is next, but I think I have a lot of questions. Oh, we're going to go right through lunch. The file system support for SECCLOCK? For SECCLOCK? What was that? No, that was file system support for user namespace. So right now, user namespace allows you to change the processes of view. So the processes look at each other. They see, you know, get UID and they say you're rude. Everything works. The problem is the file systems don't support it. What we want to have happen is the base image. We want, basically, the current idea is to allow the base image, which is basically a slash user inside your container to be owned by real UID-Zero to whatever, but then have a translation table in the file system layout that basically says zero in this case is actually 5,000. But I'm going to get a little complicated because we need the kernel to translate it twice. So we want the VFS layer to translate it to say UID-Zero is 5,000, and then when the process looks at it, the kernel's going to say, oh, that's 5,000, and the processes state that means zero. So it's kind of clue G and happy, and I can imagine that Alvaro's going to probably hunt me down and kill me, but it's the only reasonable way to get containers to work with user namespace. But when a container writes its content, you want it to write out its UID. So you want it to write out 5,000. So that's why the base image to me has to be read-only in that case. One of the big uses for this would be OpenShift. So if you wanted to have a bunch of multi-tenant environments where you have, say, your home directory, so you got your UID, so while I'm in the container, I want to be able to write to my UID. You can tell me it's zero, but it's really my UID, so your home directory wouldn't be translated, but the image would be translated. But this can ask me all user namespace. Yeah, I have a couple questions. One is about the UID. So the kernel UID, is it a shared service, or is it something that is confined to the UID? There's nothing right now for auditing of containers that is different than ordering of any processes. So there's no separation right now. There's effort going on right now to add and indicate what namespace you are in when you create an audit. I don't know if that's ever been fully satisfied which namespace is. So the idea with auditing, right now you audit the host. So auditing is things about who wrote this file, who read this file, and what people want to basically say, I'm root reading and writing this file, but I'm actually root inside of a container reading and writing this file versus real root on the host. So some way to identify the namespace and I'll go to someone else. I know you have full access to me all the time, so I'm not going to hurt you. Anybody else? Yeah. Are those SELINs, like, it's popular in those UCCS labels, those are the third-party memory numbers? They're on every, everything Red Hat ships. Every Red Hat upstream, downstream, are all turned to SELINs turned on by default. So Docker, and I haven't tried to Docker package, so yes, it's on everywhere. The nice thing about Docker versus standard SELINs, if you're running, say, an atomic host platform or just containers on your environment, SELINs should not get in the way at all because Docker is taking care of all the labeling for you. The only case where SELINs could be a pain is if you volume out into the container, but the reason that colon Z qualifier, they'll actually take any directory you label in and it will change it to the label of the container. So it will, it actually just runs the file system and changes the SELINs labels. So SELINs, to me, SELINs on Sforum should be on everywhere. SELINs on containers should be on everywhere. And then if you have trouble with running other services on it then get those services the hell off. You should just be running, containers should be running on a container platform, the virtual machine should be running on a virtual machine platform and other processes. Yeah, obviously I'm talking about server environments to be a developer environment, I don't really care, but server environment, you should only be running containers on a platform or should only be running VMs. Yep. I think you know, we have a short, Okay, so right now, she had file systems into the containers. NFS is currently supported. There is a Boolean to allow Docker containers to read and write NFS. If you're using labeled NFS, it should all work. If you're not using labeled NFS as a Boolean that allows your containers to write to the NFS, share that. But you would only mount, that's only for volumes mounted into the container. So if you ever use Docker, there's the idea of a volume, to really sort of bind mounts or directories being mounted into the container that is usually writable. So, but there is no back-end support. There's no shared back-end support for Docker images at this point. All right, well labeled NFS should work if you set it up. We also have problems with steps in Gloucester. We don't. So you're talking about, the only thing that's guaranteeing separation between containers right now is the Docker daemon. Obviously the Docker daemon has no idea if it's running on it. I think something like, you have two containers and they run a single, like Kubernetes is on it. Right, so Kubernetes is, Kubernetes, one thing Docker had, you can override Docker setting the SNS labels. So if you're wrapping with Kubernetes and orchestrating outside of the multiple services, this is what OpenShift is doing right now. So OpenShift goes in and it's basically specifying the MCS labels for each one of the containers. So it's at a higher level to understand the separation. So you can't do it. It's just you need a higher level orchestration tool. Atomic scan, yeah. Yeah, so does it model to be, does it launch the image and then it's canceled? No. So atomic scan, we have built, inside of Project Atomic, on Atomic host platform, we've built a new tool called Atomic. Atomic's available both in regular Fedora and it's available in Fedora. So you can do a yum install Atomic. Atomic is basically, right now it's mostly a wrapper around Docker, but it also does a lot of functionality. If it comes to my super privilege container talk later on, today I'll explain all the technology that's in the Atomic command. But one of the things we're adding for the Atomic command is the thing called Atomic scan. And what Atomic scan can do is take that content that identifies vulnerabilities and mount up container images and examine them. So the traditional way you would, if you wanted to examine a container or examine an image, right now you have to run a process upon the container. So he's asking me, am I running a container on top of an image and then examining the image? So if I wanted to say, rpm-q of a container, I have to run the container. The problem with that is, I'm not stopping. I'm going right through lunch, and it's tough. These people all want to lose weight by listening to me drone on. Obviously, I haven't, droneing on hasn't stopped me from eating. So what we can do right now is we've built in a tooling to mount up the image at a particular directory and examine it without launching. Without launching. So we're basically able to, we're able to use tools outside of the container to look at the charooted, sort of like a charooted environment. And it uses standard SCAP. It's using, it's the open SCAP team and Red Hat is the one designing the scanning tools. Anybody else? Now that Paul's told me you guys are all missing a lovely lunch. All right, good. Anybody that didn't get a calling, well, take them. There's two different ones. There's two different colorings. Don't use bioflex for the swim presentation. At least I'm right. We need to talk about using this. Yeah! Seriously. Hi! I need some water. Nice to meet you. You're the first. Nice to meet you. Yeah. Let's get some water. You're not going to kill it. I'm joking. We're having a meeting, right? Yeah. That was my one. Good. I saw your schedule on the bra. I'm an athletic scholar. Yeah. We need to, we need a lot to talk about. Yeah. Can I have you later? I have a question. Oh! Oh! You're proving it? Well, so I was, I was writing a run label, and I needed to use op1, op2, and op3, and telling them, showing that the, like, these are basic automations, like op1 and op2, and op3 are not the clearest things to have. Right. We don't really convey information as to what op1 is supposed to mean. Right. So I was thinking it might be useful to introduce either a way to somehow pass, like, generic flags, or what else would op1, op2, and op3 is supposed to mean. But like, you could name them there kind of like, you could name them yourself. Right. Op1 and op2 and op3, depending on their command line, if you might want to put those there, those command options, that's what we'll be looking for.