 All right, is it okay if we start? Okay. Hi, I'm Scott Moser, and this is Serge Hallen. We work for Canonical on the Ubuntu server team. Canonical is the company that funds Ubuntu development and development of lots of other technologies. This session is using containers without risking your as, if you're looking for something else, then you can step out. So we'll go ahead and get started. Just as a Ford, we'll go ahead and I'll give a quick introduction to containers, and then Serge is going to take over and talk more about the in-depth on containers and specifically username spaces, and then I'll give a demo-like experience for username spaces, and then we will have some time at the end for some questions. So I'm sure that most of you are familiar with what a container or the idea of a Linux container is or at least have a vague idea. It's as popularized by Docker or Parallels or LXC. Generally, it is the ability to run multiple operating systems on the same system or multiple Linux-based operating system inside of the same system and get isolation and confinement on those. Essentially, you can think of it like lots of times, it's described as CH-Root on steroids or BSD jails or Solaris zones one way or another, you're probably familiar with the technology. From inside the container, the system generally looks like a full system, and from outside the container, it generally looks like just a bunch of processes to Linux but there is no single set of Linux container. There's no, the kernel doesn't really know what a container is. It's a user space fiction built on a bunch of different Linux kernel facilities. If I've not confused you enough, then I'll hand over the phone to Serge. Okay. So the different kernel features that you can use to make up containers have been coming over time into the kernel. The first one arguably was actually the Mount namespaces that Alphira introduced in the year 2000. Before that, when you did a mount, there was a single global Mount namespace, and anytime you mounted something, all processes saw the same thing. What this did was each task had its own Mount namespace, and normally when you clone the task, the new task would get a reference to the parent's namespace, but you could request, is this, am I moving it too much or is it? If you asked for a clone UNS during a clone operation, it would give you a copy of the Mount namespace, and there was no one share system call at a time and there were no Mount propagation. So from that point on, any Mount done in the one task would not be seen by the other task and vice versa. The idea was to do what Plan 9 does, where unprivileged users can all manipulate their own Mount namespace. There's two problems there resulting from the fact that Linux is different from Plan 9. One is just the fact that in Plan 9, everything is a file and so you can start a new task and you can bind Mount a file into place from another host to forward all your traffic to the other host or to change your display. You can't do that in Linux anyway. The other problem is that in Linux, we have this thing called SETUID and we have root super user privilege. So in Plan 9, you could manipulate your namespaces and do what you want and then if you want to, say, become a different user, you talk to a factotum service that still has its own clean context and it does the authorization for you. Whereas in Linux, you run, say, SETUID with binary which is going to check Etsy shadow for your password to validate it. Well, if you can change your Mount namespace and then bind Mount your own file of Etsy shadow, that's going to completely corrupt the authentication there. So that is one thing that is actually solved by user namespaces eventually, as we'll see in a bit. So that showed up in the year 2000. The first thing that looked like a container that you could do upstream was in 2006. That was the pin namespace. So first, there are a couple teams from OpenVZ and IBM that all worked together to get container stuff upstream, and they had different reasons for doing so. OpenVZ, of course, wanted containers. The team I was with at IBM, we want to check when to restart. So all we wanted was to, if you have two tasks that are an application, one is waiting on the other one, it does that using a process ID. But if you checkpoint it, kill it, and restart it, the process ID may have already been used by another task. So you want to be able to guarantee that that process ID will be available. So we did this with a dirty hack, a virtual PID. I think it was Alan Cox came back with a suggestion that the Mount namespaces was actually a very good analogy to use for what we wanted with process IDs. So then in 2006, we quickly pushed the namespace proxy, the UTS namespace, and the PID namespace, and other namespaces followed. So what a namespace is, when we talk about kernel namespaces, any time you ask the kernel to do something for you, you pass it some identifier that you have to your handle in user space, and you say act on it, and then the kernel will translate that to some kernel resource that it acts on. For instance, a path name, or a file descriptor, or a PID. And so when we namespace something, like the process IDs, that means that when you create a new task, you can have a clean namespace or a clone of a namespace where your identifiers will refer to different objects than for another task. So you can see that that can be used to an extent for isolation, in that if I create a task with a clean namespace and manipulates its namespace so that it does not have a path leading to the hosts at sea shadow, then that container in theory cannot read or write to the hosts at sea shadow. So you can try to get some security guarantees out of that. The problem is that Linux is a very baroque thing with lots of ways to do things. And so some examples here, you may create a container with a fresh mountain namespace that doesn't have access to at sea shadow, but if you don't have a fresh PID namespace too, then you can sit and wait for a host process to log in or change its password, and then you can actually access at sea shadow to its proc, PID, FD, and the file descriptor number. So by itself, namespaces, they're necessary, but they're certainly not sufficient for trying to get some security. And there's a lot of ways to do things similar to this. If you clone only the PID namespace, you can actually still signal tasks in some cases by writing the files that will trigger a kill to a certain task. So the next feature used in... The next feature used for containers is control groups which showed up in 2007. And what they do is they group tasks together and then on those groups, you can account to resource usage, you limit resource usage, and there's some special sea group types like the devices sea group that let you do... Let you say this group cannot access device dev as the SDC one, but it can access other devices. So you can see where that can be useful for constraining a container, but again, without a username space and without other ways of limiting it, you can have root... Without a username space, root in the container is root on the host and if it can remount the sea group file system which you use to control file systems, it can simply move itself back into a control group where it isn't constrained the way you want it. Another thing that's very commonly used in containers or was very commonly used in containers is the capabilities bounding set, which is introduced in 2008. And what that does is so that the root super user privileges are broken down into pieces like the ability to manipulate the network, the ability to use files that aren't your own as if you own them. And so you can take those pieces and you can say this task and all its children from now on will not be able to use that privilege, which can be useful, but there's a few problems with it. The first one is that these capabilities are rather their course, they're not fine grained, and so anything useful you wanna take away from root in the container also takes away other useful things that would be safe for it to use. Another problem is that root in the container being root on the host still means it doesn't need any capabilities to access any root-owned files on the host because it owns the files. So it doesn't, if it can get access to a host-owned root file, it can do what it wants with it. And then due to simply the way that capabilities are broken up, there's a paper here that describes how with just this one or two capabilities, you can get all the other capabilities back. So again, by themselves, the bounding set was not sufficient. And so then I'm gonna generally skip this slide, it just goes over how LS7s in Seccom could also be used to further the security. So the crux of the problem is that without a username space, root in the container is root on the host. And so there's almost inevitably always ways for it to find some foothold back onto the host where it can then say I'm root and it can get out of the container. And for a long time, the answer was wait for username spaces, they will solve this. Now actually, the first username space patch was introduced in 2007, yeah, 2007. But all it did was it gave you separate accounting for the same user ID in multiple namespaces. But it didn't provide any isolation or any security guarantees. So if you had a file owned by user ID 1000, then user ID 1000 in any user username space owned that file. So you couldn't segregate the containers that way. So we worked for, we spent a lot of years and did a lot of prototypes trying to find a clean way to actually provide the isolation. And finally, according to my get-tree in December 2012, the first version of the final design of username spaces was accepted. Yeah, so the design criteria were these. First off, the same user ID in multiple namespaces must be completely segregated. A file owned by user ID 501 container must not be owned by user ID 500 in another container. And likewise, if it should find access to a process ID of user ID 500 in another container, it shouldn't be able to kill it. They should be completely separate. Root in a container must be privileged over the container. It has to be able to kill tasks belonging to other user IDs. It has to be able to configure its network and all that stuff that we normally associate with Root on the host. But at the same time, it must not have any privilege outside of the container. So the goal there is that Root in a container basically needs to be the same thing as an unprovenged user on the host. And that allows us to let containers be safe for use by unprovenged users so that they don't need any Root privilege at all to create or run a container. And then lastly, we want them to be nest-label. So if you create a username space underneath the initial username space, you wanna be able to create another username space under there which has the same isolation guarantees from the parent name space as it had from the initial one. Okay, so the design in the end is as follows. User IDs in the kernel are now known as kernel UIDs, KUIDs, and they are distinct from what's in user space. And anytime that user space tries to talk, make requests to the kernel, the user ID that user space passes gets translated from its name space into the kernel. When you first boot your machine, you have a translation, one to one for the full range from zero to minus one from KUIDs to UIDs on the host. And then as you create a new username space, you can take ranges from the parent name space's alka range and map those into your user ID. And if anything shows up like a file owned by a user ID that isn't mapped into your name space, it'll show up as being user ID minus one and your access to it is as if you were just user other. So if it's got world repermissions, you'll be allowed to read it, otherwise you won't. Now the unprovened user, if an unprovened user can create a new username space and by default he can only take any user ID in that new username space and map it to his user ID on the host. If I'm user ID 1,000, I can either leave nothing mapped in or I can map user ID zero in the container to user ID 1,000. So then when I look at the file ownership, it's owned by user ID 1,000 on the host but user ID zero in my container. Now, of course, to create containers, we wanna be able to do more than one user ID and the next slide will show how that's done. But the last thing then is that all other namespace types like the process ID namespace and amount namespace are owned by a username space. And so when you first clone your new username space, your network namespace hasn't changed. So it is still owned by the initial username space. So you can't configure networking. But after you've cloned your username space, you can clone a new network namespace which is empty and now you can configure that because your username space owns that. So that's how we solve the mountain namespace problem because anything that you will have privileged over is something that you owned anyway as user ID 1,000 on the host. The last thing is for like I say user ID 1,000 user ID 1,000 by default can only map user ID 1,000 in his namespace to a new namespace. So the administrator on the host can delegate sub-UIDs, just a portion of UIDs from the host and say this user may use these sub-UIDs. And then by keeping those segregated, he can keep containers from affecting each other or affecting the host. The allocate, this part is done through the shadow package, the shadow tree. And Etsy sub-UID and Etsy sub-GID are the files where the allocations are stored. And you can just use the user mod program to grant or revoke sub-UID allocations. And they just get written to those two files. Now, you still as young privileged user can't take user ID 1,000 for instance and map it into your new username space. But you use two set-UID work programs called new UID map and new GID map, which will look at the allocations and say, yes, he's allowed to use 100,000. And then it'll do the right to a proc file for you to assign that user ID into your namespace. So that way each privileged user can be delegated sets of sub-UIDs and sub-GIDs to use in their containers. They're kept separate as sanely as possible by user mod. So when you create a new user, it'll automatically find an unused range to assign to users. And that's it for that. Okay. So I'll take over and do a somewhat of a demo. These are just canned screenshots and things that I've made, but actually walked through all of this on an Ubuntu 14.10 instance. So running the set three, six, 16 kernel. Generally the setup is the host in the next slides is named LXC-host. There's two users on the system, Elsa and Anna. And each of them are configured to run user space, unprivileged containers. There's two programs that I basically used in the slides. One is called My Weight. It does nothing, but basically touches a file and then sleeps and prints out its PID, its UID, and so you can identify who he thinks he is. And then show info is just some front ends to PS and different things so you can see better off. So here, this is the host, or I'll go over the things that are highlighted here are the host. You can see that Espin and it is the one that's highlighted there is owned by GID zero and root and PID one. Then you can also see that in that section up there, each of the highlighted user NS, that's the inode number of the user namespace. So that's indicating those four processes there are in the host's username space. And the rest of them are not. Here you can see so there Elsa is running My Weight and she's got PID 13109 and Anna is running My Weight and has PID 1182. Yeah, okay, I'll go to the next slide. So here now this is a setup of the LXC of how we set up LXC on the system. As Serge talked about there's Etsy sub UID and Etsy sub GID and those files are configured to give Anna and Elsa. I gave them a million UIDs and GIDs each. Generally, most Linux has come with the user nobody that has a GID and UID of 65.534. And so if you don't map, somehow you basically have to account for that or your nobody user's not gonna work. So you can either map a range and say when you create a container, I wanna map a zero to 1200 and then also map 65.535 to a different PID or you can just deal with it and start containers each with 65,000 PIDs. The one trade-off of doing that is PIDs, I meant UIDs. The issue with doing that is that the UID is an int 32, so you have only 65,535 if you hand them out at 65,535 each. So it's a limited space there. So back up here, the container there. Oh yeah, so and then the config for Anna C1 is configured to have up there, it shows the user ID of 2 million and the next 65,535. And then here down here, we've got Elsa and Anna are each running the MyWay program and their PIDs are shown. Let's see, so here is focused on Anna's first container. So Anna's running two containers and one is dash C1 and her S been in it is believes inside the container that it's PID, let's see, we're lost. Oh, I'm sorry. So Anna's MyWay program is PID 592 and then if you look on the, but from the outside the container, the host sees that as PID 12177 and then the user ownership that you see there of 2 million on each and then the init PID, which as general init thinks it's PID 1 is PID 6900 when viewed from the outside and then up in the top that the show info files is showing the file that it touched. So the file permissions from the host are seen as 2 million and GID 2 million. So this is Anna's second container. The only thing to point out here is that if you compare its init to her C1 init, they have different UIDs because it was mapped a different range. So her second one started at 2 million 100,000 and the first one started at 2 million. So you can see the files are differently owned than the first containers and the ownership and the processes are running as a different user also. So this is, and then this is Elsa's C1 host and the difference here is that I ran instead of running as root, running my weight as root, I changed user inside the container to Ubuntu user which is PID 1,000, I think, or UID 1,000 inside there and ran the my weight. So inside he thinks he's ID 1,000 and GID 1,000 but if we look at the process outside, you see that's 3 million 1,000. Yeah, 3 million, yeah, whatever. But then you can also notice that the user inside there is different from root inside there. So the init process is running as 3 million and the Ubuntu process is running as 3 million 100,000 or 1,000. So generally, yeah, and then the user in S identifies those and there's the range that we're given. So this is just Elsa's second container generally and then showing her processes there in the same way. Generally, I think to notice is that inside the container the user IDs are not zero. So to point out, to make that, so inside I believe I'm zero but outside the kernel actually sees me as being 3 million 100,000 there and the key thing in there is that if you break out of that container, if somehow Elsa escaped from this container or this process escaped from the container, they're just PID 3 million 100,000, UID 3 million 100,000 on the host. So even if they were to get out and have access to the host file system, they can only read files that are world readable. And then even outside, they're no closer to becoming either Elsa at UID 1002 than they are to becoming Anna or reading Anna's files. So just the username space provides you with pretty much the level of security that you're used to on a multi-user system, right? So we're generally functional or generally accepting of running multi-user systems where each user is not privileged to see other users' files and we trust the kernel to not allow me to kill your files, your processes, and vice versa. So essentially this is exactly that. Each of these containers is no more privileged than a multi-user system or in the same idea. Okay, so how safe is this? Essentially for some time we've been running that there's been a lot of work in past decade or whatever to make basically to stop running processes as root, right? And so generally it's considered best practice to not run processes as root if you don't have to and user name spaces and Linux containers allow you to actually to do that and to have each process with a finer granularity. And inside of that container, root, inside the container is just an unprivileged user on the host and that's an important thing to notice. And however it is a user on the host. And so you basically then inside that container you have a local user, right? And the kernel, the Linux kernel has plenty of CVEs. They come out, there are known exploits against the Linux kernel. Your distribution does their best job to patch those and get you the kernel updates as quickly as they can. But in 2014, there have been 101 CVEs and 14 of them local privilege escalation could be used for local privilege escalation, not necessarily shown, but I guess. And so if you're doing this, if you're running containers and you have users in those containers you need to be definitely aware of CVEs and address those, take your updates from your kernel. And also notice, and a lot of times you'll see comparisons between, people will say what is the difference between security of a container and a KVM. Definitely you need to be aware of CVEs in all points in time and address your kernel in both situations. And I just pointed out that some of the exploits would have affected KVM also. So there is no silver bullet per se for security but be aware of your CVEs and apply fixes. So this is, and then here I tried to put together how you can use this, how you can use name spaces or user name spaces and how you can best run Linux containers. Basically three eight, if you've got a Linux three eight kernel, but realistically I think that's three 10 is what you should say is the minimal to be running to do containers, user namespace containers. If basically any of the distros listed there if you're using the current version or the current in Ubuntu's case, 14.04 LTS, if you're using the current one you should have support in your kernel for user namespaces, but you won't have it in minus one. So I mean Red Hat has it in seven but not in six to my knowledge. And the same thing is true of 12.04 on Ubuntu. And then the tools that can take advantage of this, if you've got LXC 1.0 and there's some improvements in 1.1 then you can use their namespaces there. Libvert has support for running user namespaces and actually in Juneau the Nova Libvert LXC driver got the ability to use user namespaces. You can look at the documentation for that. And then Nova Compute Flex is a project we've been working on at Canonical to take advantage of LXC more directly than through Libvert and by default we're doing that with user namespaces. So each container would be running in the user namespaces rather than running as a privileged user. And then I understand that Parallels is on their roadmap for early 2015. I'm sure that, okay, I got a nod that that's correct, so. Okay, so these are, this is generally what we have. The slides that are available there at that URL, you get your welcome to email myself or Sir Chalan and or ping us on IRC, there's the credentials there or the name. There's also several different articles listed there, three different articles that are really good reads. If you want to learn more about user namespaces or about containers in general, actually in general the LWN articles there are fabulous, LWN does a wonderful job of writing kernel information, so. That's what I have, if we have some time now I'll take some questions if there are any. Okay. Yeah, those are used by the SETUID root programs, new UID map and new GID map. And so the way you actually want to actually create and populate a user namespace, you do a clone or an unshare of user namespace and then proc, self, UID map and GID map are the files that you write to. And you write the, I forget the order because I don't usually use it as a file, but it's the, I think the namespace first UID, the parent namespace first UID and then the number of UIDs you want to allocate. And so root can do that with any values that are valid in its namespace. You can only do it with your user ID, but so new UID map and new GID map do that on your behalf, subject to the Etsy files. Well, all other namespaces are owned by a user namespace, yeah. And so in the kernel code itself, what used to be just a call to capable cap net admin was changed to a NS capable and then the network arrow user NS comma cap net admin. And so it checks for cap net admin relative to the network namespace, the user namespace to which the device belongs. Right. And so in container managers, what you have to do is provide some way for admins to delegate. So like in LXC, there's a file at C LXC slash LXC user net, which says this user can create on this bridge so many network interfaces. And so the admin has to carefully decide what bridge is that safe on. Like you don't want to do that on a bridge with E0 because then they can read our traffic. Yes. You can, yeah, while the root on the host would have to pass it in, but you can pass the device in and then you can use it in. It'll be owned by your namespace then. So you can do that. Yes. I think he did that manually. The two million, 21 million. Yeah, I set that up automatically. Add user sets up on Ubuntu, add user adds 65,000 of them for user. I think, is that right? Yeah. Yeah, you do. So there is a default, but since I wanted to do two different containers and not deal with cutting up the ID map, I just gave them a million instead of 65,000. Okay. But the point is you're not having to manually manage that yourself, like keep track of the ID spaces. It depends on how flexible you need to be. You can use the defaults, but he manually allocated them because he wanted more than the default allocations. Right, and in Nova Compute Flex and that, the idea was what we're fighting on doing there is in order, was just to give that user, to give that user that's running Nova Compute Flex, all of that range. And then it would cut containers into it, but he has basically the entire range and then he manages the cut-ups. Not to my knowledge, you could, question? Do you wanna answer that? To get further. But yeah, and essentially with containers and user namespaces, you can feel if you would give someone an unprivileged user shell access, then you can give them an unprivileged container access. It's that level of security. Yes, yeah, that's through C groups. Yeah, through C groups you can set memory limits. So let's see, and the Nova Compute Libvert driver does this and so does the Nova Compute Flex. Yeah, so your instance type size will be two gigabytes of memory and they get two gig of memory. I think that LXC uses AppArmor. You can, both are integrated. So we, in Ubuntu, AppArmor is installed by default, so we're integrated with that with the policies in Ubuntu. Oracle ships some aesthetic stuff and is working on further refining that. So both work and policies are available for both. SMAC support is not yet there, but one day it might come. He did, he did. No, I didn't. I thought NSC-1 was nested. She could have created as many containers as she wanted underneath there. Yep, yeah, yeah, and then from, and then outside, outside you'd probably be able to figure out the biology. There aren't a lot of good tools that will just display that for you, like, yeah. Actually, I was impressed. PS has, PS is able to show the file, I know number, and that was very useful. And it shows, it can do that for all of the namespace kernel features, so, that's nice. Thank you. Okay, well, thank you. Thank you.