 Hi, I'm Rodrigo. I'm a software engineer at Microsoft. Hi, I'm Marga. I'm director of engineering at ISOvalent. We are here today to talk about a new feature that has just been added to Kubernetes that will help us run our workloads more securely called user namespaces. Rodrigo, can you share with us a bit of background of what motivated this new feature? Yeah, you know how in Linux distributions, most applications run with their own user and group? Sure, yes. For example, NCinex or MariaDB or BIND, they all run with their own users and groups. This has been the case for many years because it limits the permissions for each application running on the machine. Right, exactly. But when we're running our applications inside containers, they usually run as root. And this root in the container is also root on the host. Yeah, that doesn't sound like a good idea, does it? Indeed. This can be a serious problem. For example, if there is a vulnerability that allows a container break out, like the CV in 2019 5736, this was a vulnerability rated with high severity that allowed someone controlling the image of a newly created container to override the run-c binary on the host. That sounds scary. The attacker would basically have root privileges on the host and could then move on to compromising the whole cluster. Yes, and what is interesting is that the whole problem could be completely avoided if Kubernetes had support for user namespaces, as that mitigates the CV and many others as well. So cool. Can you tell us about another one? Sure. Very recently, while we're already doing the work of adding user namespaces to Kubernetes, another CV rated high was discovered. This CV was in the year 2022, and the number is 0492. In this case, a malicious container could trick the kernel into executing a certain binary file as root after a process finished, allowing them to get root access on the host, and with user namespaces, the CV is also completely mitigated. In short, running our containers as root is a problem because whenever there is a vulnerability that allows escaping the container, the malicious process can compromise the host, other pods, and even maybe use it as a stepping stone for a more cluster-wide attack. Right. In the end, the technologies we use to isolate our containers are also software, and as all software, they help us. So the least privilege that we use to run our software, the better. And user namespaces is a very important step in this direction. But before jumping into the details of user namespaces, let's clarify that with namespaces, we don't refer to the namespace parameter we use in Kube CTL. That is just a logical grouping of pods. During this talk, we refer to the Linux kernel feature for process isolation. There are actually quite a few Linux namespaces that we use when we start a pod in Kubernetes. Let's take a look at them. For example, we use a UTS namespace to give a unique hostname to each pod. This way, the pod doesn't have the same hostname as the host it is running on. We also have a network namespace. In this case, all containers in the pod have the same networking namespace that allows them to communicate over the local host interface. But they need to use the virtual network to communicate with containers running in other pods. That way, pods only see the network traffic that they are supposed to see. There's also a PID namespace. The process visible in each container will run are limited to the process inside each container. Each container only sees processes from their own container. And similarly, with other namespaces, like the mount namespace, the IPC namespace, or the cgroup namespace. But of course, the focus of this talk is user namespaces. So let's talk about how those work. As we said at the beginning, before Kubernetes 125, the user running inside the container was always the same user as the one outside. So whenever we run a process in our containers of root, they were running as root on the host as well. On top of this, if we were using an unprivileged user and granting a specific capability to it instead of using the root user, the capability granted to the container were running in the host as well. Now, when we enable user namespaces, the user in the container is no longer the user on the host. The processes in the container can still run as root, but all interactions with the system will be mapped to a different user. A user namespace isolates security-related identifiers and attributes of a process. In particular, user and group IDs, keys, and capabilities. The way to use them in Linux is to split all of the available IDs and assign subsets to different user namespaces. The whole range of UIDs available is a 32-bit number. In the Kubernetes implementation, we reserve the first range of 16 bits to the host, from AD0 to 65535. This means processes and files in the host should use this range for UIDs and GIDs. The 16-bit range of UIDs after that can be assigned to the first pod in the node using user namespaces. We map UID0 and GID0 inside the pod to UID65536 in the host, so root in the container is an unprivileged user in the host. Inside a pod, the range of UIDs goes from 0 to 65535 or 16 bits. And all the settings in the pod spec that refer to a user or group ID, like RANAS user, RANAS group, FS group, and so on, refer to the ID inside the container. So the whole space of user IDs is split into 65,000 chunks of 65,000 UIDs each. The IDs assigned to a pod are chosen automatically by the QLED. No two pods have the same range of IDs. There is no overlap. You can think of this a little bit like node port on a service. You leave it empty and some free port will be assigned. Usually, you don't really care which one as long as it's that free port. The same applies here. Some range will be assigned. We don't really care which one as long as it doesn't overlap with the ranges assigned to other pods. Because of this mapping, the amount of damage a rogue pod can do to the host and to other pods is very limited. And obviously, a pod will never be mapped to the host ID zero. This means a pod with username spaces will never run as privileged on the host. Also, the capabilities granted to the pod are only valid inside the pod. For example, if we grant Capsis admin, often considered a new route, the capability is only valid inside the pod and very useful there. But if the process breaks out of the container to the host, the capability is not valid. Additionally, there are some capabilities that are just not usable when inside a username space. One of these is Capsis module needed to load a kernel module. If you grant Capsis module to a pod using username spaces, you still won't be able to load a kernel module. So, Rodrigo, how do we enable this super cool feature? To use this feature in Kubernetes, we just need to add the host user field to the pod spec. We set it to false to indicate that we don't want to use the host users. This will create a username space so the users are isolated. This follows the same syntax as other namespaces like host network or host IPC. And what is super interesting about this is that the vast majority of workloads can enable this without doing any changes in the app and just benefit from the extra security. So cool. So what are the current requirements and limitations for enabling username spaces? That's a great question because we have a bunch of them now. But we'll work on removing some of them in the coming versions. First, as we already mentioned a couple of times, we need at least Kubernetes 125. Then we need supporting the container runtime to use this feature too. Do we have the supporter ready? It depends on the runtime. If you're using cryo version 125 already includes this. If you're using container D, the patches have not been merged yet, but we're aiming to merge them for the one seven release that is expected this year. And if you're still using Docker, you can't use this feature today and there are no plans from them to implement it, at least for now. I see. And how do user namespaces interact with other namespaces that we can also enable or disable in Kubernetes? When we enable user namespaces, we can't use any other host namespace. This means we cannot mix host users' falls with host.network.true, host.ipc.true, or host.pid.true. Makes sense. The behavior otherwise would be quite confusing. Now, we mentioned that all interactions with the system are mapped back to a different UID in the host. On Linux, UIDs are used by the OS to enforce file access permissions. So how does this affect accessing data on shared volumes? If my pod creates a file in a shared volume, will other pods be able to read it? Right. That's the tricky part. If we just allow volumes with no further thought, it won't work. Because as we said, pods running in different namespaces will be mapped to different UIDs in the host. The files created by one pod would be owned by one UID, but other pods would run as completely different ones, so it will be impossible to share a volume, for example. So for now, we only support stateless pods. In the future, we plan to support stateful pods, probably using a new Linux kernel feature created by my co-worker, Kerstin Browner, called idMapMouse. This feature was created to solve these very specific problems that contain your face with volumes and user namespaces. Wait, so what do you mean exactly with stateless pods? I mean pods, either without volumes or pods that use simple volumes that are tied to the life cycle of the pod, so there's no real persistency. The current implementation supports these types of volumes, Secrets, ConfigMap, Network API, Projector, or EmptyDeer. And finally, another limitation to consider is that the files on mounted volumes will always have permission for the group. So if we want to store an SSH key and in a secret and mounted, we won't be able to use it as applications typically enforce that the SSH key doesn't have permissions for the group, right? Right, yes, this is the limitation of the current implementation. Solving this and adding support for stateful pods is on our roadmap for future releases. And another limitation that we're looking to solve in future releases, too, is that some files are mounted by the cubelet, like etcresolve.com, and those files are owned by routing the host, so they cannot be written by routing inside the container today. Right, there's always more work to be done. So how about we see how this looks in action in a demo? Sure, let's do a demo, just a comment. This demo is on my laptop, but I want to highlight that we will make this available in Azure as soon as it's feasible. Now, without further ado, let's see the demo. Here's a Kubernetes cluster running Kubernetes 125. It's also using Continuity with our patches that are not yet nursed. So what do you think if we start creating a pod? What will the pod do? The pod will not do anything interesting. It just will slip, but it will have the Capsis Admin capability. Yeah, so it can do whatever if I... It can do whatever, yeah. And we'll excite into the pod to execute our exploit. All right. This pod is not running username spaces. It doesn't have the host is a false field. So let's just first create... Apply this demo after. Let's wait until the pod is running. Let's see. Yeah, it's running. Let's see that this common is indeed running as put in the host. So the slip infinity. Yeah, and it's running as wrote. Yeah, the slip infinity is running as put in the host. So let's see the script I created to exploit this CVE. It's based on the CVE and some things I found on the Internet. The CVE works in the following way. With C Groups D1, you can have the possibility to register a release agent. There is a script that will be executed when there are no C Groups, no process in a C Group. And this will be executed as wrote. The vulnerability was that this could be written without checking the proper permissions. So here we're creating first an empty file as the release agent. And then we're adding these lines to the file. One to print the capabilities. And another one to print as which user this is running. And we leave a slip process lingering there. We're writing all of this to this file on the host. So on good on the host. And when will this be executed? So this is script triggers that there are no C Groups. And in the last line that there are no process in the C Group in the last line. So as soon as we execute the script, everything should happen and this file should exist. If we are vulnerable to the exploit that we expect. Yeah, let's see. Yeah, let's copy the script to the path. And let's execute the script now. Now output. This is probably a bad signal. Let's see if the file was created. The file is created and it's owned by root of the new ID and group ID. Let's see the content of the file. Will it have the capabilities? Yes, it has all the capabilities. And it says that it's running as route. And it's running everything as a route. And let's see if the slip process is also lingering there. And yes, it is indeed lingering there. It's running as route. End of the world disaster. So let's try again now with the user name spaces. Yes. So here is the same. The same pod Yano. We just added this field. The host user's fault. With this field, it should enable user in spaces. And therefore it should not be running as route on the host. That's what should happen. Let's see if the, if the pod is ready, running. The pod is running already. And is it running as route? Or is it running as a user? Let's check. So it's also running in slip infinity. And it's running on this, this unfinished user. So this is a good sign. So there is a thing to take into account when executing. This is the very same script. We just executed. But as you can see, start with a mountain line. The thing is we cannot mount inside this pod. But we can create a nested user name space where we have all the capabilities needed. And we can mount there. So let's first copy the script to the, to the pod. And let's verify that this line will not work. Okay. So just to be sure. Yeah, so this line doesn't work permission deny. So let's execute this. This is like technicality that doesn't really matter. But if we use this, we can execute all the lines there. So here we can see that the mount was executed correctly. However, basically most things that try to create files or directories have permission denied. The most interesting one to check is this one because we couldn't create the release agent file. This is the binary binary that will be executed as such. So we couldn't create the file things. Mells like this didn't work. But let's see if the file exists. The file doesn't exist. The content of the file is of course. If it doesn't exist, there will be no content. Yes. Yeah. And there's no sleep process lingering there either. Right. Yeah. Yeah. So we couldn't break out of the container. Yeah. With this exploit. Yeah. Just adding this simple ball in the pod and the pod spec. Did all the difference because the container doesn't run as good on the host now and these files can only be written by root on the host. So this is a very good thing because a lot of images that need to run as good. In most cases can continue to run as good fine and modify if user spaces are enabled, but just have this extra protection as they don't really need to run as good on the whole. Amazing. So to recap, since we are running several applications in the same host, we need a way to isolate the users. We're already isolating the network, the view of processes, the file systems, etc. But until now, we were not isolating the users. This left behind a lesson learned by Linux distributions many years ago to run processes as different users with the least privilege possible. In the user name spaces, we bring some of those best practices back if our applications run as written in the container. They become more secure because the process is actually running as an unproduced and privileged user in the house. If a vulnerability allows an attacker to escape the container, the damage is limited as the user is unproduced and doesn't have any additional capabilities and enable user name spaces doesn't require changes in the app for the vast majority of workloads. This means that several vulnerabilities are no longer exploitable with user name spaces enabled. For example, the first three V is in this list are completely mitigated. They can't be exploited if the pod is using a user name space. The other two are partially mitigated. Those CVs allow attackers to read arbitrary files from the host. And when the user in the host is unprivileged, we can only read files that have read permissions for others without user name spaces. We could read any file in the host. And these are just a few examples. There have been more vulnerabilities relating to accessing files in the host and most of them are mitigated as they are based on the same underlying issue exposing the host file system to the container. With user name spaces, we never run as written and so we can only read files if they have permission for others. If we don't want an application to run as privileged, we shouldn't run it as root in the host. Today, we run most of them as root and try to limit what root can do with SecComp and AppArmor. Instead, we can now run them as root, not root, with users that aren't privileged outside the pod. And we can still apply SecComp and other Linux security models for additional isolation. Exactly. Before we go, I'd like to thank Giuseppe Scrivano from Red Hat, who is co-author of the CEP with me. We both work in the Kubernetes implementation. I'm working in the container use support too with some members of the community and he also worked in the Cryo implementation. I would also like to thank people from the Signal and Sync storage community that helped to move this CEP, which was first started seven years ago by other engineers. And a lot of thanks to Derek, Manuel, Dawn, Sergei, Hem and Tim. And thanks to MicroWorkers and Microsoft, Mauricio and Albon that helped me with some early revisions of this CEP. Indeed, many thanks to everyone that helped make this a reality. And thank you for watching.