 Hi, my name is Liz Rice. I run the open source engineering team at AQUA Security where we build tools to help enterprises secure their cloud native deployments. And I'm also chair of the technical oversight committee for the CNCF. I've been interested in containers and container security for quite a long time now. And I often find myself talking about the dangers of running containers as a root. So I'm really interested in rootless containers, which I'm going to explain to you today. And as they bring a lot of security benefits because we will no longer be running containers as root. In the past, I've done a talk where I've shown what a container is by building one in about 60 lines of go code. And I'm going to do something pretty similar today, but this time focusing on how to make a container rootless. I'm not going to assume that you've seen that other talk, but I won't have time today to go into all the details that I have done in other talks. So I'll share a link at the end in case you want to explore some of those other details that I might skip over a bit today. So I am going to be coding a container today. It's going to be rootless. Let's start by exploring the motivation behind rootless containers. So by default, generally speaking, containers run as root. If you've been running with Docker or Kubernetes, unless you specify a user, you're going to run as root, not just root inside the container, the same root as on the host. So let's just show what I mean here. I have two windows onto the same virtual machine. And we can run, let's run an Alpine container. And if I find out what ID I'm running under, it tells me I'm root. I have the ID of zero and the name root. But you might be thinking that's inside the container. What does that matter? Well, let's take a look from the host's perspective. So I'm going to run a long sleep process and I'm going to look for that on the host, sleep. And there are a couple of sleep processes running, but here's the one that I just started with 1,000 seconds. And as you can see, it is running as root from the host's perspective, not just from the container's perspective. They are one and the same thing. So we could override that by specifying a user either with a dash-user parameter when we invoke Docker or by specifying a user when we build the container by putting a user command in the Docker file. But by default, we're going to be root. Incidentally, this isn't the case for all container runtimes. So for example, if you're using Red Hats Podman, it doesn't run as root by default. Now, you might also have noticed that I invoked the Docker command here as a non-root user. I'm vagrant here and let's just exit out of this container and check my ID. And indeed, vagrant has the user ID 1,000. So I'm not a root user, but I'm able to run the Docker command. And if you check out the Docker documentation, you'll find a warning about this because if you're added to the Docker group so that you can run containers so that you can run Docker commands, if you can run containers by default, they run as root. It's the same root as on the host. So by being a member of the Docker group, you effectively have root privileges. And this is a real problem in a multi-tenancy environment. For example, imagine a university scenario where you have a shared machine, lots of students who can SSH into that same machine. The administrator for that machine is not going to let those students run Docker because they could very easily get root access to the whole machine. And this is one of the big motivations behind rootless containers, being able to run containers without having to be root. So the first aspect of rootless containers is this ability to start a container as a non-root user. And as you can see here, Docker supports this in rootless mode, mentioned here as an experimental feature. There is a second aspect of rootless containers, which is that we can fool the process inside the container into, well, it looks like root from the container's perspective, but it's not the same root as on the host. But before we can show what rootless containers are and how we create a rootless container, let's just talk a little bit about one of the main mechanisms that we use to create what we know as a container. And that is the kernel feature called namespaces. So a namespace limits what a process can see. For example, let's take the process ID namespace. If we give a container its own process ID namespace, it can only see those process IDs and it can't see the processes or the IDs from the rest of the machine. Similarly, you can have the Unix time-sharing system namespace. Sounds very fancy in practice. The interesting thing that you can do with the UTS namespace is change the host name. So we can give a container its own host name that is independent of the host name for the host. And there are several types of these namespaces and one of them is the user ID namespace and this is user namespace and this is what we're gonna use to create rootless containers. But let's start by just running or creating something that looks a little bit like a container by using the Unix time-sharing system namespace. Right, so here's the beginning of my Go program and it just has a little convenience function for panicking in the event of an error. I'm gonna write a main function and when I invoke this, I'm gonna do this by running go run main.go and that is go for building and then running the code that I have in this file and that is roughly equivalent to running the Docker executable and if we were running Docker we'd do something like docker run, image and a command, maybe some arguments to that command and I'm gonna do something very similar in my executable here. I am gonna have a run command. I'm gonna not specify the image but I am going to let the user pass in the command they want to execute and some arbitrary list of arguments. So the first thing I'm gonna do is look at argument one which we hope is going to be the keyword run and if it is, we're going to run a function called run and if it's anything else, I am going to collapse in a big heap. I'm confused as I expect to get run and I need a run function and for now let's just print out what we see so far. So we want this to run the command and possibly arguments that are specified by arguments two and onwards. And while we're here, let's also log out as the user ID and the process ID, which we can get from get effective user ID and get the process ID. Okay, so I'm really not doing anything yet with this program but let's just run it to make sure it builds. So go run main.go, we need to specify run and we'd probably put some commands that we want to run. And I'll put in a new line to make things easier to read but we can see it tells us what it's been invoked with in this case I typed in some commands and it tells us that it's running as user ID 1000. We know that that's vagrant and it's got some high-numbered process ID. Okay, so let's make it actually run whatever arbitrary command we've been passed. So we can do this by setting up command structure. So the first argument to this function is the name of the executable we want to run which is in argument two. And then we pass in any subsequent arguments that may or may not be there. We also need to connect up stood in, stood out and stood up. Stood in, stood out and stood up. So this gives us this CMD structure which describes the command that we want to execute in a new process. And I'm gonna use my little must to make sure we catch any errors. We run the run method on this command and that will create a new, create a process to run whatever command we passed in. Spell args wrong. Okay, let's try it. Some commands isn't legitimate Linux so let's do something that is like echo hello. So we see our debugging line that tells us what it is we're gonna run. We're gonna run echo hello, excuse me. Again, we are the vagrant user with ID 1000 and we actually see the command being run. We see hello being printed to the screen being echoed. So I haven't done anything very special here. It's not containerized in any form yet but I've done something that can execute arbitrary commands. The next step is to start creating namespaces for the child process in which this executable runs. And we can do this with the sysproc atria field. Let's call atria. And we say, here are some flags that we want you to use when you clone the new process and we want you to create a new and we said we'd do the UTS namespace. This is the one that allows you to change the host name. Independently of the host name. All right, so let's try running this and it is not allowed. So I get operation not permitted and that's because I am not allowed to create the UTS namespace without privileges to do so. If I'm root, I do sudo to make myself root. It works fine. So if I'm user zero, I have been able to execute this hello command. I could not just echo something, I could run a shell. This isn't gonna return until I quit out of the shell. So now I'm running a shell as root user. Let's confirm the ID. And now let's confirm whether or not that UTS namespace has worked. So I can run host name here, it's host. We'll just confirm that that is true from the host's perspective. Now inside the container, I'm gonna change the host name to container and check that that's stuck. So inside the container, the host name is container now but from the host perspective, it's still host. So this is the first step down the road to containerization. I've been able to change the host name inside the container without affecting the host through the magic of namespaces. But I could only do that as root user. When I tried to run this as a non-root user, it wouldn't let me. So this is where username spaces are gonna help us out. If we look at the man page for username spaces, we can see that privileged processes are allowed to create them. So I don't have to be root to create a username space. Having created a child process in that new username space, it gets the full set of capabilities. It can act effectively as a root in this new username space. So I can start as an unprivileged user, create the username space, create a child process in that username space and have full privileges inside that username space. If I want to create some other namespaces as well, the username space gets created first and that means that process can have privileges in all of the namespaces that get created. So in other words, where my program failed because I didn't have privileges to create the UTS namespace, it's gonna create the new username space first if I add that into the clone command. And having created the username space, it's now allowed to create any other namespaces that I specify. So let's try it. So if we add in that we want to create not just the UTS namespace, but we also want a username space. We quit out of the running container and we run it. I'll run it as a non-root user. And it works fine. Let's run a shell as a non-root user. I'm still running as user 1000. Let's see if I can change the hostname, hostname container. I must be root to change the hostname, but I thought I had privileges. Well, let's check the ID. So I am a nobody inside this container and the reason for this is the ID mapping that goes on when you create a username space. So I have a little picture to illustrate this. When we create a new username space, we can also map user IDs between the host and the new namespace. So if I want to be root inside the new namespace, I need to set up a mapping between some user on the host. Let's say we use 1000, the vagrant user that I've been using. And it will map set of users starting at that point of length size. And sysproctor actually gives us a convenient field for setting up these UID mappings. So this is, I think it's actually a list of, let's check, yeah, list of syscall, sysproc ID maps. And we get container ID. So inside the container, I want to be a container ID. So in the container, I want to be root, which is zero. And host ID, which I want to map zero to 1000. And I only need to map one user ID. So we'll set a size of one. What have I done wrong here? Maybe I'm missing a comma. There we go. There is an equivalent group ID mapping, which I may as well also set up while we're doing these, because just copy, paste, and change one character. Right, so now let's see who I am. Having set up this mapping of user 1000 from the host to zero inside the container, 1000 on the host became zero in the container. And I'm now running as root. So let's see if I can change the host name. I didn't complain. Change the host name inside the container, but I've not affected it on the host. So as a unprivileged user, because I started as user 1000, I've been able to create a beginnings of a container that has its own isolated host name and it appears to be running as root. That's okay as far as it goes, but it's not really a full container and it has a, well, it's missing a lot of the characteristics of a container. First of all, right now I can see everything inside the host's file system. And secondly, I can see all of the processes running on the host's file system. And neither of those things is true in a real container, right? So let's add in some more namespaces so that we can make this look a bit more like a real container. So I'm gonna add in the new NS, it's a mount namespace, and we are going to add in, we want new process ID namespace. Okay, that looks easy enough. Let's try it. So what if I run PS? Looks exactly the same as it did before. I can still see all the processes from the host. And the reason for this is because when I run PS to look at processes, it's actually looking in a pseudo directory slash proc. If I look at slash proc, inside here there is a directory for every currently running process on the machine. And in fact, that's where the user ID mapping lives as well, so let's just find sleep like we did before. Let's find that sleep process. Okay, and if we look inside the slash proc directory, we find the user ID, sorry, process ID for one of the processes inside the container. I'm gonna call it 39995. And if we look at the user ID map, there's our mapping. This is the container ID zero maps to the host ID 1000 and it's of length one. Actually, when you create the user namespace, you get one chance to write user ID mappings into this file and that's done for us by the fact that we set this up here. This is done by the underlying go code. Right, so slash proc holds all sorts of interesting information about the running process. Let's have a look at some of the other interesting things in there just for fun. And we can see, for example, the executable is sleep. We can see the current working directory for that process. And it's from this slash proc directory that PS gets information about running processes. If I want to run PS inside my container and have it look only at processes inside the container, it's gonna need a slash proc for itself that just has information about the containerized processes. At the moment, it's looking at slash proc on the host and that's why it's seeing everything. So, in order to give it its own slash proc, we're gonna give it its own root file system and we're gonna change the root of the container. This is also gonna allow us to make this look more like a container by making it run some different code. Right now, it's had access to everything on my Ubuntu virtual machine. I happen to have on this machine a copy of an Alpine file system. I'm gonna change the root inside my container to point to this Alpine file system so that it can only see files from inside this set of directories. Right. Now, in order to do this, I want to run charoot. And you might think, well, I could run it before I run this run method here. But that would affect root for the running process. We want it to only take effect for the child process. There are a few different ways that we could go about this. I'm going to do it by a little trick of having this run function create a child process which is then gonna go on and create the process to run our executable. This will make more sense as I do it, I promise. So I'm gonna have a child function. This child function doesn't need to create any namespaces or do any ID mappings or anything like that. This child function is going to execute our arbitrary command. But before it does that, I'm going to change the root. And I'm gonna change that to home vagrant alpine file system so that it's looking at an alpine file system. When you drive my editor correctly, there we go. When we change the root directory, it actually leaves the current directory in an undefined state. So it's a good idea to explicitly change directory. We'll change to root. Okay, so that's my child function taken care of. Now, I'm gonna change this so that instead of executing the arbitrary command, it's gonna call back into running this program again, which it can do by looking at proc self-exe which represents the currently running executable. So I'm gonna have this executable create a child process to call itself again. But instead of having run as the keyword, it's gonna pass in child so that it knows it's the second time around. And I just need to make this into a list of strings to which I append whatever the arbitrary command is and any arguments. And finally, up here, I need to say if we get invoked with the child keyword, call child. So when we first invoke this command, we invoke it with run and it sets up all the namespaces, creates a child process in which it calls itself but with child as the keyword. So the child process instead of running run runs child and that's going to change the root directory and then execute our arbitrary command. Let's see how that goes. All right, so far so good. We are running as the root user inside process ID one. You'll notice this is starting from one. We're in our own process namespace. So process ID numbering has started again from scratch. But what if I actually, no, let's have a look at where we're running commands from. So for example, LS is coming from slash bin. Let's have a look at that. And it's actually a symbolic link to, or a link rather to busy box. If you're familiar with Alpine, you'll know that many of the commands in slash bin are actually links to busy box which contains them all inside one executable. Okay, so if I look at the root directory, we'll find that it matches what's inside home, vagrant Alpine FS. We have successfully changed the root for this container so it can only see what's inside this subdirectory. This is effectively like the image file system, an unpacked image, unpacked onto the host file system, and then the container can contribute to look at that subdirectory as if it's a file system. All right, there's one last thing we'd like to see working and that's the PS command inside this container. But it's not quite working. And the one thing we need to do to get this working is to tell the Linux kernel to treat slash proc inside this container to treat this as the special kind of proc file system in which it writes all that process information. And we do that by mounting it, as a proc type file system. So, get a proc, proc, proc, there we go. And we will tidy up at the end by unmounting this proc file system. Okay, right. We're running as roots inside a container that looks like the Alpine file system. And if we run PS inside this container, we only see the processes that we should see that are running inside this container. Not only that, we've been able to run this rootless because we've been able to do this as an unprivileged user. We're doing this as user 1000. So that's how we create a rootless container. And you might be wondering when you're going to be able to see that in your own containerized systems, containerized system near you. And the good news is it's coming very, very soon. You already saw that in Docker, it's been available in experimental rootless mode. And at the time I'm recording this, the October 2020 release of Docker engine isn't yet released, but the release notes have been written and it tells us that rootless mode is expected to graduate from experimental. So by the time you see this, it's probably already available. In more good news, there is an enhancement proposal that's being worked on actively at the moment to support username spaces inside Kubernetes. So I think signs are looking good that we'll be able to use rootless containers in our own containers very soon. If you want to know more about rootless containers, the rootless containers repository on GitHub has lots of great resources and code. Shout out to Akihiro Suda, who has done a lot of work on these rootless containers. And I also mentioned the talk where I dive into some of the other details on how a container is created. You'll find that on my GitHub under containers from scratch. I hope that's explained what rootless containers are. I'm more than happy to answer if you have any questions, just reach out. Thanks.