 Hey, everyone. Just a quick announcement. If you are not here for the root to root list, you can bail now. I think the .NET track is next door. There seem to be some room confusion, probably my fault. Track says container. So this is the container track and the root to root list. Are we good to go? Or should we wait a little longer? All right. Then let's go. So I've got to give this FireExit announcement first. You read this. I actually saw a note here. It says, you only have to give this announcement if there are 49 people or more. And I thought, that's great. That's a win, win. Either I do have to give the announcement or I'm in the big leagues. So I feel like looking around, I'm going to leave this up for a second. My name is William Martin. I work for Pivotal London. If you looked at the schedule, you might get kind of confused because you might look at this and say, wow, that William Martin guy, he's giving four sessions. And it turns out that I'm not the only William Martin at Pivotal, and I'm not the only William Martin at Pivotal giving talks here, which is totally outrageous because I joined Pivotal first. But there's another William Martin here. He's giving a few talks on .NET. And you should definitely go to those. I work on The Garden Project, which provides containers to the Cloud Foundry platform to concourse CI and to Bosch Lite. You can get me on Twitter at WLA Martin or GitHub William Martin. Or if you want to chat after come find me, we can talk containers to Cloud Foundry. Oh, sounds good. So this is The Root to Root List. And I just want to lay down one grand rule. I heard some murmuring outside. And I know that American and English, slightly different dialect from English to English. But every good talk starts with a good title. And this one is going to be The Root to Root List. I don't want to hear anything else. While I was preparing this talk, I realized that my co-engineer, Ed King, out there, he's been on the team for a long time with me. And sometimes he goes by Ted. And I go by Will. And it's taken me two years to realize this. But I thought I could pitch this alternate talk title, Bill and Ted's Root List Adventure. But I didn't think this would inspire a lot of confidence in Garden and me as the speaker. So we're doing The Root to Root List. What does this talk about? Well, containers are great. And they're a really good way to secure applications and to isolate them. But there has historically been some concerns about the piece of software that orchestrates the container lifecycle, so creation, running processes, deletion, et cetera. So the Docker daemon, or in our case, the Garden server. And the reason for that is that typically they've had to run with privileged super user Root in Linux. And that's really a bad practice. That is against Linux best practices. Because if we have a vulnerability and we get exploited, well, immediately the attacker has root to the system. So that's pretty bad. So what I'd like you to take away from this talk is really three things. One, what are the pieces that make up a container in Linux? How are those being facilitated now in an unprivileged or rootless manner? And how Garden is utilizing some of that to make the platform more secure? So let's start with a high-level problem statement. What security problem is Garden solving for Cloud Foundry right now? Well, the problem for Cloud Foundry, security-wise, is it's public. So you don't really know anything about your users who are running software on the platform. It's multi-tenant. So you could have malicious actors running alongside everyone else. And it can run Docker workloads if you've enabled the Docker feature flag. And that's painful because you could put almost anything in a Docker container in a Docker image. And all that together leaves me with a profound sense of distrust for running things. In fact, when I was putting the slide together, I was like, oh my god, what are we even doing here? But how do we solve that right now? How are we dealing with this security issue? Well, containers. So what is a container? Or more specifically, let's say, what isn't a container? This is my Kaiser Soze quote. The greatest trick containers ever pulled was convincing the world they exist. Containers in Linux are not really a thing. They are an abstraction over some kernel tools, some kernel primitives. And so I like to think of a container more as a contract, which I've laid out here. And a container runtime should satisfy this contract. So we're talking confinement. So containers should have their own view of system resources. They should also fairly share those resources with other processes. And they should be unable to modify those first two constraints. And then from a dependency management point of view, in the container, it should be able to have whatever dependencies it needs, regardless of the host that it's running on. But then on the host, and this is kind of slightly controversial, but if you've been to Dr. Jules's talk earlier, he mentions the difference between Linux containers and containers as they're well-known. Because a few years ago, a little company called Docker came along, and they figured out how to transfer, to how to ship and store dependencies in a performant and efficient manner. And so we'll get on to that as well. So let's do it with a first bit confinement. How do we achieve that? Well, we've got this thing called Linux namespaces. So there are a set of global resources in Linux, things like network devices, mount points, that kind of stuff. And namespaces allow you to rep these resources, such that processes appear, or believe that they have their own isolated set of those resources. And this is just some of the, not just some, these are the seven namespaces that are available in Linux right now. And I'm just going to call out the username space specifically here, because it's going to become really important later when we talk about doing stuff unprivileged. Username spaces allow you to act as an unprivileged user in one namespace, and then child username space act as a privileged user. And we'll get on to what that really means later. We talked about fair sharing. And this is really satisfied in two ways. A garden, we've got control groups or C groups. And they deal with things like, if Linux namespaces deal with what a process can sort of see, C groups deal with what a process can really use. So things like memory limits, or a disk throughput. Also things like CPU scheduling. So with those, and then with disk quotas, so ensuring that processes can't write so much to a disk that other processes are unable to, we start to deal with this sort of fair sharing. And disk quotas I'm going to get on to quite a lot later, because it's been a bit of a pain point for us. So if those two Linux namespaces and control groups were dealing with the first two bullet points, so isolated set of resources and fair share, we can now start to talk about how we make it so that processes are unable to modify those constraints. Now, the namespaces and the C groups sort of overlap here with these next three slides. Because, as I said, containers aren't really a thing in Linux. They've just got all these edges that overlap. But here's an important one, dropping capabilities. So historically, processes could run as privileged user or not. And now, over time, the kernel has started to split these up into capabilities, which can be independently toggled. And to give you a flavor or taste of what these are kind of like, if you ever see cap underscore, you know what I'm talking about, Linux capabilities, we've got things like cap set UID to change UID, or cap sys admin, which unfortunately is sort of in Linux world becoming pseudo root. But it's certainly a better world than one distinct toggle. We've got set comp, secure computing mode. This is a kernel technology that allows you to limit system calls available to processes or flags on those system calls. So for example, in garden containers, although they can do the clone syscall, which they need to create new processes, they cannot create new namespaces because we limit the flags that are available on that system call. We've got app armor, which is mandatory access control. So whenever you enable this on a process, you cannot turn it off. And this is sort of defined as a text file with rules like deny under the process file system, read write or execute to sysrq trigger. And so app armor provides a way to once again limit privileges. And so you sort of think, wow, there's a bunch of sort of things on top of each other here dealing with those confinement bullet points. And that's true. We do defend things in multiple ways. And that's because we sort of have this aspiration, which is to be the most secure container provider out of the box. Now, I'm not saying this is the truth. I would never claim that. But this is an aspiration. This is a goal for garden. And we like to think of our sort of security onion that you peel away one layer, and you've still got another mechanism of defense. And we like to believe that operators should have to explicitly take action to reduce their security rather than toggling levers to ensure they get the right security. And in fact, in practice, this has been demonstrated in a few ways. I just pulled out some CVs in the container world from the last year and a half or so. And for example, the one in the middle, we've got username spaces and app armor that prevented the SCSI mic drop vulnerability, which is great because some people who were running Docker containers who didn't have username spaces by default didn't have app armor turned on, found that, well, with this vulnerability, you could just disconnect the hard disk. So what we find is, in practice, most of the CVs that come out, we've got multiple layers of the onion protecting. That's really important. So moving on, we did confinement. Let's talk dependency management. So in the container, we manage this via the system call called pivot root. And on the host, you could talk about layered file systems. So how does pivot root work? This is a system call that allows you to change what a process sees in slash, at its root directory. And it works very simply, something like this. We've got a process called run shoe. And it says, oh, what's in slash? And the files that come back are boring, boring host Ubuntu files. And then it can run pivot root, just passing in this new directory, basically. And that's all it needs to be, is really a directory and top of a file system tree. And with that in mind, when it asks what is in slash, it's going to get this cool container busybox file system. And so that's how we enable, in container land, if you think about maybe Docker images, which often start with a directive like from Ubuntu or from Alpine, you can run these on any host. But at the end of the day, you get that distributions, dependencies in your root file system. Finally, we've got layered file systems. So quick segue here into how they came about. Got your process run shoe, that's what we name things. And it's got some operating system specific dependencies. Well, one way we can deal with this sort of packaging of dependencies is to wrap it all up as a virtual machine image. So an AMI or a VMDK, something like that. And that's cool that works. But it's got some drawbacks. For example, virtual machine startup historically has been a little slow. But a bigger problem really is there's a kind of duplication that can happen here. So if you've got a web server that relies on Ubuntu and a database that relies on Ubuntu, and you package them as images, well, you end up having that entire Ubuntu distribution twice. And these end up being sort of multi-gigabyte files or can be anyway. So we can be smarter than that. And this is a thing that really Docker popularized with layered file systems. I mean, they've been around for a while. Don't get me wrong. But Docker really popularized this. And how this works is that you can have a single directory containing your Ubuntu distribution treated as read-only. And then use some copy on write file systems to layer other directories on top. So for example, you can have your single directory which contains Ubuntu, and then a directory which contains your dependencies for your database, and one that contains for your web server. And then all you do is, through some special mounting, just apply deltas on top of Ubuntu. And so you only need to transfer that around once and to store it once. So quick segue here again. Enter Run C. So I just wanted to talk about this for a bit. We don't do a lot of this work in garden. This really comes from Run C, which is a CLI tool, came out of the open container initiative. And it really does a lot of the heavy container lifting. Really the only thing I'd say it doesn't do from that list is that the layered file system stuff, which we manage in a tool called GridFS. It's used by Docker and Kubernetes, and we wrap it up and just make use of that. And that's really important because we build on the work of a lot of really good work from a lot of really good people. So everything is secure inside a Cloud Foundry container, right? Well, yes, or hopefully. Seems to be playing out that way right now. But if you weren't Dr. Jules's talk earlier, you might have seen this analogy. I think it's great. If you've got this door, which looks pretty secure to me, and you kind of treat it as if it's the container, you're trying to exploit the system, and you run, you barge against it, and you don't get anywhere. You start thinking, well, what else can I do to get through this door? And you look at the wall, and you know if this glass jug of Kool-Aid could smash through that wall, well, you need to work on your security. So again, we have this aspiration, which is to be the most secure container provider out of the box, right? And we've done a lot of work on securing the containers themselves, but now we need to start thinking about the piece that orchestrates it, the garden server. So that brings me onto the real purposes of this talk, which is the route to rootless. So when I was putting these slides together last night, I was looking at this image, and I was thinking, wow, it's got some twists and turns, but it looks pretty straightforward, and really the route to rootless has felt a little bit more like this at times. You know, it's had its stops and starts, it's had its troubles, but hopefully I'm gonna explain how we got over some of those things in the next sort of 15 minutes. Before we get into that, some really important shout-outs. These are people in the container community that have really spearheaded the rootless work, moving to unprivileged. We've got Jessie Vazel. I think she works at Microsoft now. Alexis Sarai, he really, really did a lot of work at allowing Run-C to run unprivileged. That's been super important, and hopefully I say his name right, Ake Hirosuda. He's been doing a lot of this work in Container-D. I just wanted to call that out because we're likely, if you went to the garden update talk, you see we're making some moves to use Container-D in garden. And there are many more people, of course, who are kernel maintainers, et cetera. But I think it's really important to call them out. You can look up their work, but also because if we didn't have them, our journey would have ended pretty quickly. It would feel a lot more like this. So let's look back at our contract. What is a container? Well, we've got confinement and dependency management. So let's deal with the first bit. Unprivileged username spaces. I said I'd come back to username spaces earlier, and these are really key to a lot of this work. So since Linux 3.8, you've been able to create new username spaces as an unprivileged user. And the key thing to know about this is that every other name space, so the isolation mechanism for processes, has an owning username space. So in all the initial username spaces, or sorry, in all the initial name spaces, they are owned by the initial username space. But as an unprivileged user, you can create a new username space where you have privilege. And once you have that privilege, you can create the rest of the namespaces. And then if you were to do a privilege operation on one of those namespaces, so for example, maybe a mount call, the kernel will check against that mount namespace, find the owning username space, and see whether you have privilege. And so once you've got privilege in your username space, you can do all the other namespaces, you can do set comp, and you can do app armor. So how do username spaces really work? Well, users are living a sort of double life. So in the host, you are boring lame Edward Norton from Fight Club, but in the container, you can be cool sexy Brad Pitt from Fight Club. And how this works is that you specify some UID mappings from the outer username space to the inner child username space. And these files exist per process, proc self UID map and GID map. And there are triples of UIDs, inner ID and outer ID, and then a range how many user IDs you want to map in that username space. So I pulled one of these mappings off a rootless deployment of garden. And to read this, you can see we got a zero, which is in the container ID zero, is actually the initial username space, Maximus. So the max UID possible. So as this unprivileged Maximus user inside the username space, you are privileged. And we only map that one ID. But in garden, it is important that we have the ability to have many user IDs in the namespace. So we also map everything from above 65,000 to Maximus minus 65,000. Why do we not do those bottom 65,000? It's just another layer of security because potentially if there were container breakouts, fingers crossed, then you might end up as a user that actually has some privilege, something that has pseudo, something like that. So we try not to map those users in at all. So that was username spaces. But what about unprivileged username spaces? By default, you can only map yourself, your own ID into a username space as an unprivileged user. And this sort of makes sense because you shouldn't be able to become any other user anyway. But there are these two binaries, new ID map and new GID map, they exist in the shadow package, and they are set UID. So what is set UID mean? Set UID means if you execute this binary, or execute this thing, you will change user to whoever owns this file. So if these binaries are owned by root and you execute it as an unprivileged user, you will change your user to root and you will gain all the privileges that come with that. But importantly, this is really the key bit, mappings are validated by these tools against an Etsy sub UID file, which is only writable by root. And so that allows root to decide what unprivileged users should be able to map. We PR support or the initial version of support to run C for this. And that's now in there, we're using it fine. So more confinement stuff, but thinking about fair sharing. So there's no way to do C groups entirely unprivileged yet. And that's because they're really just a file system. And whenever you have that file system, you mount the file system and all the files are owned by root. But we need to be able to change these files as unprivileged users. So during a setup phase in the garden server, we can show them these because we will be privileged at that point before spawning our demon as an unprivileged user. I'll get onto why that's not great, but okay later. And again, we PR support for unprivileged C groups to run C. Dependency management in the container, real simple. Username space gives us this capability, capsis admin, and that allows us to do the mand operations required by pivot root. And then the dependency management piece, this has been a little harder. So by, you know, historically garden had its root file system management built in in a library called garden shed. And the overlay mounting was done using AUFS, but it's not possible unprivileged. Can't even, you know, there's no chance. It doesn't work. So what we did was we thought that was a good seam and separated plug-in interface. And that ended up being this thing called GRUDFS. And GRUDFS, initial version, use better FS snapshots to do these layer file systems. We could do an initial setup as privileged and then at runtime, everything could be done unprivileged. Snapshots are finding better FS unprivileged. But unfortunately what we found was that the disk quota is didn't scale very well and everything eventually blew up. So we went back to the drawing board and we found out overlay FS. So there's a key thing here, it's an unfortunate restriction. Ubuntu in conversation with the overlay FS maintainer decided that it was acceptable to do overlay FS operations as an unprivileged user, that it was not a security risk. And so on the Ubuntu distribution compiled into the kernel is a flag which says this is fine, you can do operations on overlay FS as an unprivileged user. So that kind of unfortunately right now ties us to Ubuntu if you want to work rootless. But seems to be working, so that's good. Now how do we do that then? How do we take advantage of this unprivileged mounting? Well I pulled out a snippet here from a configuration file. So RunC expects a JSON file that conforms to the OCI runtime specification. And in there there is a section called mounts, which RunC will perform once it's moved into the right namespaces. So we can request instead of us doing all the overlay mounting in our initial user and mount namespace and then telling RunC here use this path, use this directory, we can ask RunC to do it all. And we can see here we've got a whole bunch of, we've got a source and type of overlay, a whole bunch of options which tell the overlay which directory is to mount and where. And then we've got a destination, which says just mount it all at slash, which will be the root file system for the container. So what are the roadblocks right now to being fully unprivileged? Well, this quotas for one, better if I handled this, but unfortunately blew up. So we had to use XFS, another file system and it requires privilege, there's no way around that right now. So again we pulled out a small focused set UID binary that just focuses on this quotas. A big part of containers that are still requiring a lot of privileges networking. And again this used to be integrated into garden and for reasons, not security related, but for container to container networking, we also pulled out a seam here. And another team created this thing called the Garden External Networker. But actually in the end that's been really useful for us because we can just make that piece set UID or set cap, they might have done set cap, which is just instead of changing UID you can gain a certain number of capabilities when you execute a file. But there is some really good work going into this from Alexa Nakahiro, who I've seen have done, have some proof of concepts of networking running unprivileged. So there's hope here. And then there's the garden setup phase. So I said we had to show in the secret file system and we do, but it's sort of okay because it happens before there's any user input possible and before any workload runs in Cloud Foundry. I did see at least one attempt on the kernel mailing list by Alexa to fix this, but no luck yet from the maintainers. So there are a few things which we'd like to fix. We're not there yet, but don't worry, be happy. And this morning I thought, oh, this would be a good place to have an image. So I Googled, don't worry, be happy. And one of the first hits was this and this is terrifying. This does not make me feel happy at all. But it is, it is pretty cool. So don't worry, be happy because really we're playing the long game here. You know, we recognize that we can't just big bang this whole thing. We want to reduce privilege where we can and when we can because something's take time, but that's actually okay because proving them out from a functionality and a security point of view is important. And one way we found to do that is by breaking apart sort of monoliths and both of these cases for file system management and for networking, we were, we had libraries embedded into garden. We pull those out into plugins. It's really important that we share these technologies with the community. Like I said, I can not overstate how important RunC has been to achieving this goal. Historically, we didn't use RunC. It wasn't a thing before the open container initiative came about and all our containerization code was written in-house, this thing called Garden Linux. And I cannot imagine doing this piece of work on that code base now. So the community's given a lot to us and it's important that we give back and hopefully when as we move to container D we're going to be able to continue doing that. So take heart because things are getting better. So does it work? I wish I could say, yes, it definitely works. But all I can say right now is hopefully it does pass Cloud Foundry acceptance tests in the default deployment and that's great. And hopefully it's gonna go out on pivotal web services soon so it'll get some real production traffic. How can you try it? Well, Bosch is the way we deploy Cloud Foundry in Garden and there is a flag you can set, experimental rootless mode and that's it. That's the only thing you need to change. We also have a standalone binary just for playing about with that you can get on GitHub. And we've stood that up on, we've grabbed that, stood up on Ubuntu virtual machine and RunGarden as the Ubuntu user and it's worked fine. So I'm optimistic. So that's actually the end. So thank you for listening. Although one more thing, if you found this interesting and you like containers, then you should come to the next talk which is in here or maybe next door, I'm not actually sure now, by Dr. Jules who's sitting over there where he talks about why you absolutely shouldn't care about containers at all. So that'll be a really good talk, I've seen the preview. I do have some time for questions I think and I definitely have my colleagues in the room that I can shunt them to if I don't know the answer. So any questions? Yes, okay, sure. So the question is what's the relationship between Grud FS and Overlay FS? So I'll tell you what, Grud FS is terribly named, horribly named and I'm entirely responsible for that. I think I recommended it on Slack and Jules just ran with it because he loves terrible names. So Grud FS does a lot more than Overlay FS in that it's sort of built on top. So whenever Garden or Diego or Concourse requests a container from Garden, it will pass, it'll request an image, could be a Docker image or it could just be a, it requests the container to be created with a particular root file system which could be a Docker image or it could just be a tar file. And let's say it's a Docker image, Grud FS will go off to whatever registry and pull the manifest for that image and then look at that manifest and download all the file system layers that are referenced within that manifest. Once it's got that, it changes depending on whether you're running rootful or rootless. If you're running with root, then Grud FS will take those layers and do an overlay mount. So it'll take all the layers and do a mount syscall with the operations that you sort of saw, where is it, here. It's like passing this lower directory is an upper year and at the end what you'll have is a directory, moded by overlay, which acts as the root file system. Okay, if you're running rootless, then what Grud FS will do is produce the JSON that you see here and pass that back to Garden and Garden will embed that within Runcs configuration file. So Grud FS really does sort of a whole bunch of image management stuff, overlays just the bit that will take the directories and layer them on top of each other. Does that answer your question? Yeah. Okay. Anything else? All right, that was enough time. So thank you very much.