 Hey, everyone. I'm Greg. This has been IAC. And we're talking to you today about least privileged containers and somewhat bigot on GKE. So this is going to be a little bit weird because this is being pre-recorded, but we're going to go with it anyway. So hands up if you've heard this security advice. Don't run containers as root. And then keep your hand up if you've actually done it. If you've actually implemented that advice and your containers aren't running as root, I'm going to guess that most people put their hands down for the second part. And we've been saying this sort of stuff for a long time. And I mean, we as in the collective sort of Kubernetes security container community, you can actually watch Liz Rice give a keynote in 2018 at this exact conference about how you shouldn't run containers as root. Is it working, that advice? And I think the short answer really is no, unfortunately. So Cystig wrote a report that said 76% of containers that they see run as root. And I think they've got pretty good visibility of what's actually happening on a lot of machines. And that pretty much matches my experience, too. So we're going to talk about what we did with GKE and Kubernetes itself. We migrated a majority of our containers of the GKE platform system containers. So the containers that actually run GKE. And we migrated most of this to non-root. Today, we're going to talk about why we would even bother doing that. What's the point of running as non-root? Basically, how you do it, what our strategy was, how we sort of went about moving all these containers, some design choices we had to make, all of the little bumps in the road that we hit. And then we're also going to talk about a future feature that's coming that should help make this a little bit easier. So why do we even care about this? What does it matter if containers run as non-root or not? If you think about how a container sort of looks at a high level, there's a container with a container runtime and a host kernel. And optionally, there might be some stuff mapped into that container, like host path and host network, that kind of thing. So if you're a bad guy who's compromised that container and you're looking for a way to get at that host and get at other containers that are running on that host, then you've got a few different ways that you can attack there. You can use vulnerabilities in the runtime or the kernel, or you could escalate through some of these sort of additional host bits and pieces that are mapped into the container. And really, not running as root just makes that a whole lot harder. So some of those vulnerabilities just flat out don't work if you might have heard of the Azure Escape vulnerability, that one wouldn't work if you weren't running as root. And even if they do work, for the most part, the attack is gonna end up in this kind of container in this box that is a user account on a Linux system. They still have to escalate the root to get at all of these sort of other containers and various other things. So it really does provide you quite a lot of protection, but sadly, no one's using it. And just a little bit more on kernel container breakouts. I'm making this sort of fairly bold statement here that pretty much every Linux kernel has a breakout right at this moment. And the reason we know that is that we have a well-maintained Linux kernel that we pay a lot of attention to security on for our container optimized LS. And we actually pay a bug bounty, this thing called the Kubernetes capture the flag. And so we put these flags around on a well-maintained GKE system. If you can break out of a container and get to one of those flags and we'll pay you a bunch of money. And so with the whole sort of scope of GKE and Kubernetes and container runtime and all of that is in scope. And we've been running this container program for a couple of years now and we've paid out $1.3 million, but literally every one of those dollars has gone to a Linux kernel container escape. That's how many of these things there are. Just last year we found and fixed 17 exploitable breakouts. And this is all generating some cool research where the sort of collecting all of these breakout exploits is really helping us drive some upstream kernel research that hopefully will make this a little bit harder for attackers in the future. But the point here is that this attack path is definitely very viable. And we really need that non-root protection. How do you run as non-root? First let's start with what does non-root even mean? So looking at the top line here, you can run privileged containers and we're not really gonna cover those much in this talk. So they run as root, but they also run with every single possible root capability more than is normal. And they also do things like disable some security controls on the host. So Eric wrote a blog post that I've linked to here about how it's not even really containing much anymore if you're running as privileged. So we're mostly gonna ignore this use case but best practices don't run as privileged. You really don't need that much privilege in the vast majority of cases. But what most people are doing is the middle one here, which is just running containers with their defaults and just rolling with it. And we're gonna talk about how you can move from that situation where it's root and you probably didn't even really know to a situation where it's much less privileged. And one more thing we're not actually covering in this talk is running the container runtime itself as non-root. That's actually a pretty cool idea and there's some good research and implementations. You can read more about that at the website here. So two real parts. You have to sort of make the container work as non-root. And then once you've done that, then you can tell Kubernetes or Docker or whatever to run it as non-root. And we're gonna illustrate that through a quick demo. We're gonna start out just by running regular nginx container. And on the bottom part of the screen here, I've got a process listing running and it's showing us what user those processes are running as on the host. So nginx is running a root user here and they're actually running some unprivileged users as well. And what's going on is that they're split out all of the privileged stuff they need to do, like the load port finding and that kind of thing, configuration, and then all of the sort of more dangerous stuff that's internet facing that's gonna get compromised, that's running as an unprivileged user. So that's actually a pretty good setup. But I think we can do better. We should be able to run this thing as non-root completely. Let's just try telling Docker to run it as this unprivileged user nginx. And we can see it doesn't work. The user inside the container is trying to write to this directory which is owned by root and so we get values. So we're gonna have to fix it. And that involves making a new Docker file and effectively making a new image, starting from nginx and then fixing all of those permissions that it's trying to do in the course of its normal business. So we're making sure the ownership of those files is owned by this unprivileged user. And then we're telling Docker to run the container as that unprivileged user. So we don't even have to pass in that username anymore. So let's build it and run it. And as you can see, no username passed in here. And you can see on the bottom here we're now running completely as non-root. So all the processes are not privileged anymore. And so that's good, but it was a whole ton of work to get there. We had to understand how nginx works. We had to understand what files it needs to modify as root, go fix all those with the right user. And that's just too much work. Really, the container owner should be doing most of that for us. And to their credit nginx has done exactly that. So here's a completely unprivileged container, an official one that they provide that does all of that work that I just did without any of the maintenance burden that sort of the user of the container has to take on. So that's great, move on. That's good for one container. What if you want to migrate a whole bunch of containers or a whole, change a whole company's work of containers to be running as non-root? That's what we're going to talk about next. So here's the strategy that we took to at least move all the GKE containers that we could. And it's pretty much the same strategy for every security thing we do, I think, and that's you stop the bad thing from happening for all of the new stuff that's coming in. So for new containers that are running on GKE, we want to shut the gate and make sure that we aren't adding to the problem. So we block the new root containers coming in, and then we work with all the ones that we've already got. And we pay down that. That's sort of a burn down list that we work through. And wherever we can, we'll fix upstream as well too. And so we did this in a bunch of places with Kubernetes itself, some links here to some of the changes we made to run the core containers as non-root. So in terms of doing that, stopping the new root containers from coming in, we use this thing called a pre-submit inside of Google. So pretty much everything in Google goes into source code at some point. And if you're the same as true for GKE, if you're a developer on GKE and you're going to make a new GKE system container, then you need to commit that piece of YAML into source code. And we actually check it just to make sure that it's not running as root. The other part of the strategy, and we're going to talk more about that in just a moment, actually, in more detail. The other part of the strategy is migrating the existing containers. And so we had a security-driven process here. So the security team, and actually it was pretty much just benign, looking at all the containers, looking at their pod specs, changing them to run as non-root, and then checking that test pass. If the test pass, then great, we're pretty much done. And we can send it off into production after the container owner's taken a look at it. If they don't pass, then we figure it out with a container owner. There's some things we need to change inside of that container, or sort of in the configuration of that container. And then we get it to run as non-root eventually. To make this work, we had to make a bunch of design choices sort of upfront about how we're gonna manage these non-root containers. These are sort of the three main ones. And just to remind you here that these are GKE system containers. And so they're really sort of infrastructure heavy. These aren't your typical sort of like run a web app, fairly, not really much interaction with the host kind of containers. These ones have a bunch of interaction. They do things like stand up proxies and look at the host log files and various other interactions with the host. So in terms of complexity of doing this, it's actually probably a lot harder than if you're just migrating a bunch of application containers. But the very first thing we had to figure out here was are we gonna do this in the container or at runtime in terms of making it run as non-root? So if you own the container, then doing it in the container is great because then anyone who's running that container just gets better security and they didn't even have to do anything. We saw that with that Nginx example. But for the most part, our approach was using configuration to change the container itself. And then then we didn't have to rebuild the container. We didn't, it was really easy to audit. So we could just look at the configuration of the container and just be able to tell, yes, this container is running as non-root. You can template it if you need to. And it really enabled us to do all this in a central team without having to contact every single kind of container owner and modify the build of the containers and that kind of thing. So you need all of these bits and pieces in this YAML here to make a container run on privilege. We're gonna go through them in some more detail later. Enforcement and Audibility. That pre-submit check I mentioned, we're just coming back to that now. And so basically that piece of YAML I showed before, it's just checking all of that. So all of those things need to be in the right settings and it's checking that that user and group are unique and that they are not zero, so not root. You don't have to do it that way, though. Like it's not like everyone has this pre-submit system or like a repository that looks like the way Google manages code. And so you can definitely do this at runtime. You can do it with the pod security admission, restricted mode, it does all of that. Checking for you, all of these things are already in the restricted profile. Or you could do it with pretty much any of the admission controls around. Put links on the slide here to the gatekeeper version of all the policies that you would use to do this at runtime. That's another way to do it too. What do you do about these user IDs and group IDs? And really the goal here is that we don't have any collisions. So we just want to make sure that this container that needs these privileges isn't ever going to share a user ID with other containers that don't need those privileges because if a bad guy compromises one, you shouldn't get access to a union of all of those permissions. So it's better off if they all are very separate and there's no sort of collisions or overlap in those user IDs. So basically we decided on a range and then we used the pre-submit to make sure that those user IDs and group IDs were unique. There's some other ways you can do this. You could, I think conceivably, write a mutating admission controller that keeps track of the user IDs. It's allocated and sets them in the container for you. That's a bit of work. You could coordinate off cluster so you could have a list or a database somewhere that you maintain. But I think really, right at the end of the talk, we're going to talk about a new Kubernetes feature that fixes this in a way that basically means you don't have to worry about it too much. So don't worry about this problem too much. So I'm going to switch over here to Vinayak and have him talk about the challenges we had and how we solved them. Thanks, Greg. I'm going to talk about some of the challenges you might encounter when you migrate your containers to non-root. And there's two major challenges. One is access to things on the host, and the other one is managing capabilities. So let's first look at accessing things on the host. So as Greg mentioned, our containers are infrastructure heavy. So sometimes they need to set up the host in a certain way or they need to interact with the host so that GKE can function. And one such interaction could be access to directories and files on the host. So here I have an example of connectivity server container which is running as non-root and it mounts this host path volume, which is this directory connectivity server. At boot, it tries to create a Unix domain socket to interact with QAPI server in that directory. And so you might run into some cases when you want to give a non-root container access to a host file. So let's see one possible solution. So if you've interacted with Kubernetes part security context before, you would have seen this field called FS Group. It basically sets the group ownership of mounted volumes to the group that you specify an FS Group. It works for other volume types, but it won't work for host path volume so you can't effectively use the solution. And our way around it was to use init containers. So here we have an init container which will run as root and will set the ownership of that directory. And one property of init containers, init containers is that they run before any of your long running containers and they always run to completion. One thing you might be thinking about is that, hey, these people have been telling me to run containers as non-root and they're running this as root. And that's okay because the risk here is minuscule because this container runs to completion and will only probably last for less than a second in this case. So let's look at a demo of this in action. So here I have a pod which is running this connectivity server container. It's running as user 2046 and it's mounting this connectivity server directory from the host. Here I'd also like to point out that it's incorrectly relying on FS Group to set up the directory permissions for it. So let's see what happens if we apply this to our cluster. Now let's take the status of that pod. We can see that it's failing. So let's look at the logs. Here I've modified the container a little bit to LST directory permissions. But the first thing we should notice is that the socket is actually failing to create with permission denied. And if we look at the permissions for this directory it's still owned by root. So let's try and fix that with an init container. Here we'll just look at the init container for now. And as you can see this init container is running as root and it's going to change the ownership of the connectivity server directory to user 2046 which is the user connectivity server we run as and this will run before the connectivity server container. So here I've modified the YAML already to include this init container in it and so let's apply that modified YAML. Now let's look at the status of the pod. So it's good that it's running now now let's look at the logs. And if you look at the logs we noticed that the init container ran before the long running container and it set the directory permissions to 2046 and we can see now the socket is successfully created. Great. So this init container approach will work if you have if your containers accessing one file but there were cases where we had a directory or a file that multiple containers needed to access and this init container approach will not work because you would effectively have both those init containers trying to fight to set the permissions correctly and you don't want that. So our solution for this kind of cases was to use supplemental groups and supplemental groups is a list of groups that you can provide in the pod security context and basically your user is treated as part of those groups. And so what we did was for high volume directories or files by high volume I mean which were accessed by multiple containers we give them unique IDs to them and then the containers would just add those unique IDs in their supplemental group. The advantage of using this approach was that we never had to manage adding the UIDs of all our containers to the groups so that they can have access to these files now. You can just set the supplemental group and you get access to it. Hopefully those tips and ticks will help you in your journey to migrate your containers to non-route while you need to access things to host. But the other thing that you might face or the challenge you might face is that interactions with capabilities. So let's look at that now. So capabilities are a way of assigning some part of Roots capability to a unprivileged process. And we recommend that you drop all the capabilities and only add the ones that you need. And we also recommend that you do this for your root containers as well. The other field here, which is of importance is allow privilege escalation, which you should always set to false. And what that will do is it will prevent a child process from having more privileges than the parent process. There can be some weird interactions between allow privilege escalation and capabilities. And I've documented some of those at the not often needed link in this slide because they're not really needed very often. So here I have a surprise for you that if you drop and add capabilities to a root process that works, but if you do that for a non-root container, it won't work without requiring to modify the build of that container. That's because when you transition from root to non-root, basically all the effective and permitted capabilities are dropped. And so that prevents from this from working. For this particular capability that I called out here, NetBind service, which is required to bind to low ports, there's solutions where you can just bind to a higher port and then you don't have to deal with this. But a more general solution for allowing non-root containers to have capabilities. Initially we thought that that would be admin capabilities, but this is also very elegantly solved by the Linux username spaces feature. And so we've changed our mind to just rely on that feature now. So let's look at how we fixed, how we overcame this challenge and we relied on using file capabilities and how those work is that if you have a non-root process that is running a binary which has file capabilities applied to it, then those capabilities will show up as in your effective and permitted set. And so let's look at a demo of how we modified the container with file capabilities. So here I have a core DNS container that's running as user 2004. It's correctly dropping all its capabilities and it's only adding the NetBind service capability, but the build of this container is not modified in any way. So let's see what happens when we apply this to our cluster. You see that it's editing. Let's look at the logs to see what's going on. So we can see that it's failing to bind to port 53 which is the default port that core DNS will bind to and that is a privilege port because its number is less than 1024. So now let's see how we modify the build. Here we've added a few more stages. In the first stage here, we install the libcap to package which will give us some utilities to manipulate file capabilities. And then we use this setcap utility to add the cabinet bind service capability to the effective and permitted set of this core DNS binary. Now let's build this image and push it with this file caps tag. Now let's update our pod with this new tag leaving everything else the same. Now let's look at the log or the status of the pod and we can see that it's running. So using file capabilities, we were able to fix that pod and run it as non-root while binding to our port. So if through all these demos you had the thought that hey, this seems harder than it should be. You're not wrong, but hopefully some of the tips and tricks that we showed you make your journey to migrate all your containers to non-root easier. The future here might seem kind of dark with those demos, but it's actually pretty bright and the thing that we referenced many times in this presentation here is the next user namespaces. The next user namespaces support in Kubernetes would basically mean that running as root inside the container does not mean running as root on the host, which is the exact property that we want. And so let's look at a demo of this. So here I've set up Docker to run in a user namespace and I'm gonna run a container with the name user NS. And as you can see here, and this is going to run the core DNS image that we looked at earlier. And this is the version of the image that does not have file capabilities. And on startup, this is going to run the core DNS binary. So let's run this container. Now let's look at the state from within the container. So let's do a process listing within the container to see what's going on. So as you can see this core DNS binary is running as root. So this process that's running, the first process that started the container is running as root. So it thinks it's running as root within the container though, which is the critical point here. Now let's do a process listing on the host to see what the state of the process on the host is that's running this core DNS command. So here I'm listing for processes that is running the core DNS command and listing their user ID and UIDs. And as you can see, this is running as a non-root UID on the host. And so this is exactly the property that we want where running as root in the container should not mean that you're running as root on the host. And this Linux user name-space feature has been added, support for this has been added in Kubernetes. And this was first discussed in GEP 127 and has since become alpha in Kubernetes 1.25. So it's available in all versions past 1.25. To enable this, since because this is an alpha, you need to add the feature gate user name-space support. You also need Continuity version 1.7.0. Know that this support is experimental as a 1.7.0. And so you will also need Run-C 1.4. This feature today only supports stateless pods, which are pods that only mount empty-ders, secrets, and config maps. So if your pod mounts any other volume type, you will not be able to use this feature. We tried really hard to come up with a demo for this feature, but since Continuity support was experimental and some of these things are alpha and brand new, we couldn't get it to work. But in general, to get this feature to work, you would also have to add host users to your spec. And hopefully by the time you watch this, we've gotten it to work. So now let's compare this feature to some of the design decisions and challenges we had to make. So first is that you don't need to modify your container. You can just leave it as running as root because running as root now does not mean running as root on the host. The second decision that we had to make was should we do with this in-container or runtime? You just do runtime. You just set the host users field to false. In terms of enforcement and auditability, for us, this won't change much because we will just check at pre-submit for host users being false. And you can do similar auditability. The UID-GID management is really great with this feature because Cubelet will handle allocation and uniqueness of these UIDs as seen on the host. And so you don't basically have to do this. This problem completely goes away. Support for host file systems and other host things is a big maybe with this feature. And so hopefully some of the solutions that we showed around running things as non-root while still accessing things on the host help you with that. Root capability management is again great because having capabilities inside the user namespace are only valid in that namespace and not valid anywhere else. So they're void everywhere else. So this feature also fixes that challenge. With that, I'm going to hand it over to Greg. Cool. Thanks for that. All right, where do we get to here? Hopefully we convinced you that there's quite a lot of security value in running as non-root and you're motivated and excited about doing that. So if you're a container owner and you're watching this talk, it would be great if you migrated your own containers to running as non-root. Just make the whole internet safer and no one will even really notice probably in most cases. The things will just be much safer. If you're watching this talk and you're thinking about, well, I have a whole ton of containers or a whole company of containers that I want to try and do this with, then hopefully you can use those strategies that we showed about how you can sort of set up a program to migrate these. And the stateless ones are fairly easy. So if that's most of your containers, then hopefully it's not too much work. And as we sort of like explained, the volume amount ones are sort of where the difficulty is. And really this host user's feature should make this a bit easier in the future. So thanks for coming and watching the talk. Please, there's a QR code here to send you to the schedule where you can leave feedback. The slides are also there. The demo code is linked here. And we've also got a big pile of links of all the things that we linked on the slides. Thanks a lot.