 My name is William, and I'm a software engineer at Facebook. And today in this short talk, I'm going to be giving a presentation on container security at Facebook and a particular system that we built to handle a particular class of container security problems. So just really quickly on the agenda, we're going to start off talking about containers at Facebook, what makes containers at Facebook versus some you may have encountered elsewhere. Talk about container security in general. And then CAPMOND, and the problems that it saw in the container environments, and then how we're leveraging CAPMOND to build tight sandboxes. All right, so what exactly is a container at Facebook? Well, to start off, each container gets its own view of the file system, and it runs in its own C-group. And to do that, we leverage standard technology, standard namespaces, like C-group, mounts, PID. We use C-group V2. And we actually run a full init system within the container, which allows to do things like running SSHD, our syslog, cron all inside the container. And that makes our containers very powerful, and capable, and self-contained. But it also requires that our containers have a certain amount of privilege. So in contrast to some of the things we don't use, we don't use usernamespaces or network namespaces, although we do actually have our own homegrown ways of doing network isolation. But just keep this in mind later on, because this might come into play for some of the security issues that we might talk about. All right, so let's quickly get context on container security as the can arise. So what could go wrong if we have a container running on a host, and it's got root privilege? So in this particular case, we have dev that's read-only bind-mounted from the host in the container. What could the container do to subvert this? Well, one thing is it could just remount that device and change something in the container, and then now we've basically bought the host. So let's do one more example, something much more nefarious in this case. So again, we have a container running as root in the host. And what if it installs a kernel module? What can we do now? Well, we can do all kinds of interesting things. So maybe it breaks out of the container, or does something like NS enters into the root namespace, and then installs some malicious crop. And then now later on, it gets removed. Some other container is running, and now it's basically monitoring this container. So I think I've belabored the point there. The issue is that you don't want containers to run as root in general. You want something a little bit more tightly grained. So there are two broad security models to handle this. One of them is basically don't run as root, and the other one is using Linux capabilities. So don't run as root is simple enough. Basically, you just don't let any processes run as root. And it's simple, and in theory, it's safe. But of course, at Facebook, it's completely untenable. And there are a number of reasons for this. For one, there are just too many privileged things that we might need to do inside of our containers, like set a UID, or bind to a reserved port, change file permissions, file ownership. All of these things require root privilege. So what you'll end up having is basically containers all running as root, and that's bad. So this brings us to the next security model, which is Linux capabilities. And as you may know, Linux capabilities are basically just a way to break up root into a bunch of privileged operations. And then you can assign those privileged operations or remove them on a per-process basis. And this is great. It's much better than the alternative, of course. But it also has its own issues. For one, it requires a lot more work to use effectively. So luckily, we actually have ways of leveraging Linux capabilities on all of the modern container engines, N-Spawn, Rocket, SystemD. They all have ways of setting capabilities. And that's great. But there are still issues even with this, because you don't really know which capabilities to assign. So how do you solve that problem? You might just say, well, let me just introspect the code and see what's going on. Unfortunately, again, at Facebook, that's completely untenable. For one, we run a bunch of third-party libraries. We've got internal tools. We even have custom executable formats. So we can't do that. Another option might be to just look at the most nefarious offenders, like Capsis Admin. That's basically root privilege, CapNet Admin, which basically says you can do whatever you want to the network. But even these actually have legitimate use cases in containers, so we can't just set one hard and fast rule and remove those. So basically, what I'm getting at here is that we really need some way of knowing exactly which capabilities a container needs, so that we can give them those, and then we can just remove everything else. And so to do this, we actually built our own system known as CapMondi. And CapMondi is a logging daemon. It monitors Linux capability usage, and it resolves it to the container associated with that usage. Some of the technologies that it leverages is C Groups v2, BPF, and Extended Attributes. And we're currently working on open sourcing this. So I just quickly want to get on the same page because we're going to have a discussion on how CapMondi does what it does. But let's first just quickly get on the same page with regard to all the technologies that it leverages. So C Groups basically are a way to bundle processes into some grouping. It's hierarchical. And with those groupings, then you can set resource minimums, maximums, these kinds of things. And notably here, important for our discussion is that basically we have a corresponding file hierarchy for this C Group hierarchy, and you configure the C Group by basically writing to that file system underneath the hood. BPF. So BPF is basically a way to decorate a kernel function with a script that you can tell the kernel to run before or after. It actually runs in Kernel Space and in a sandboxed environment, which is really great for security. And then it also has ways to export data from kernel space to user space. And finally, extended attributes are basically just a way to attach metadata to a file. And the reason this is useful for us is because it'll persist a little bit longer than the lifetime of the process, for example. So if you're monitoring capabilities for a process that comes alive, uses capability, and dies really quickly, you don't want something as ephemeral as just looking at the process and then trying to do some resolution. All right, so now that we've set the stage there, we're going to go into this toy example, basically, of how Katmandi does what it does. And we're going to illustrate that through this SystemD service file. So we have a service here. We're actually going to start it using SystemD Nspon. And this is a service file. What's most notable here is that we're going to run it in this slice, which basically is a way for us to specify to SystemD the C Group that we want that container to run into inside of. So at some point, you instantiate your service, and there it is, running on the host. And SystemD is going to take that slice. And luckily for us, it's going to tag this extended attribute on the C Group that you specified for the container. So at some point later on, your container uses some capability. So in this case, KatMagnad. And there it is, KatMagnad comes alive in the kernel. It collects a bunch of metadata. So the capability it was used, the inode number. So these two may seem completely mysterious, but we're actually going to be using this so that we can do this container resolution later. Then it exports that to user space using BPF. And then we take this metadata, notably the inode and the inode number. And we're basically going to crawl down to the C Group looking for a match. Now once we found that, we can just read the extended attribute that was set by SystemD. And then there's a final step where we do some kind of resolution from that UUID to the container that it associates with. So in the case of N-Spawn, we can use a D-Bus API and basically tell SystemD to give us the unit name. So lastly, what can we do with this metadata? Well, one thing that we're doing in Facebook is we're leveraging it to build automated capability management systems. So we take that data, we cache it locally for performance reasons, we send it to remote storage. And then on some subsequent invocation of the container, we're going to pull from remote storage and put those capabilities inside the spec. And then we're going to drop all other capabilities. All right, so that's basically how it works. Let's just quickly talk about the performance characteristics. So we've got several layers of caching that make CapMundi very performant. So first, let's just talk about the things that give rise to this really good caching characteristics. So for one, processes tend to behave similarly across different invocations, which basically means that they're going to use the same capabilities pretty predictably. And with that, we can basically cache a number of things. So in Kernel Space, we can cache the process. And basically, that allows us to only export to user space when we really need to. So most of the time, it just runs in the Kernel and pretty much goes back to sleep. Also, we have another layer of caching inside user space where we're caching the mapping from the capability that was used to the container name, so we don't have to keep doing this crawling Sys-FSC group thing. And that helps us keep our IO down. And then we actually have one more layer of caching in user space. And this is specifically because we have asynchronous workloads, for example, that do a bunch of forking. And then we have a bunch of unique PIDs. So caching by PID doesn't really work in those types of workloads. So in this case, we would want to cache by this other metadata, like the inode and the inode number. And that keeps our network usage pretty low. And this gives rise to some really good performance characteristics. So we get all this for about 50 megabytes of RSS memory. And that's mainly due to the fact that there aren't that many capabilities. There are only like 30 some out of them. And it's running on every single host. So there aren't that many permutations to cache. And then in terms of CPU, network, and IO, that's all very low. So it's just really great. So let's quickly talk about how we're leveraging CapMundi to build stuff. So I already alluded in the previous slides that we're building automated capability management systems, which work in the way that I illustrated earlier. We're also doing something very similar to our CI-CD workers. And then we're also building detection infrastructure. So the main issue here is that we don't want containers to be using CapSysadmin, for example, just willy-nilly. We really want to know why they're using CapSysadmin. So basically, at some point, when we see a container using CapSysadmin, we'll have an audit process that they'll have to go through where we ensure that they're not doing anything that they shouldn't be doing with that capability. And lastly, we're actually figuring out ways to leverage this to optimize our exec call monitoring, which already exists at Facebook. But we found that we could probably do it much better in terms of performance using CapMundi. All right, and that's the end of my very short talk. Thank you. I'll be up. We have a question. Or is that a stretch? Couple things. One, what happens when the container actually asks for capability that you blocked? Are you at least logging the fact that your CI-CD system didn't find all the capabilities that it could potentially use? Right, and so we're basically mitigating that by just ensuring that we've monitored for a sufficiently large window. So specifically, something like 30 days should be sufficient. You'd imagine that after 30 days, you've hit all of the edge cases that that process or container might have hit. And in the case where we actually don't catch something, basically in our rollout systems, we're doing this kind of in a two-phase process. So in the first phase, we won't necessarily let the process break or assign these specific capabilities and strip everything else. We could do something more like see if we got it right. And also, the other thing is a lot of our containers have multiple instances. So we might have a job that has thousands of tasks, for example. We don't have to do this capability stripping for every single task. We can do it for some subset, see if they break. If it's safe, then we can roll it out further. OK, second question. Sometimes an application will attempt to do something, get permission denied, and then we'll go on and do a more safe thing afterwards. So if your monitor basically finds the first capability it asks for and you allow it, do you sometimes just say, try to run it without the capability and see if it would continue to work? I drop all capabilities and actually see if it works. So if it does some privileged operation and this is during the monitoring phase, so we've said, OK, you need this capability, then basically that wouldn't get dropped from the set. But you bring up a good point. It could be the case that maybe they can do that, or maybe they can get denied and then do something safer. That's not something we've really looked into so far. Right now we're just saying, OK, if you need this capability, then we'll give you this capability. Does that answer your question? Yeah. OK. And I'm out of time, unfortunately. But thank you very much.