 Hello everybody, welcome to this presentation about supporting user name spaces in Kubernetes. I'm Mauricio, I work as a software engineer for Kinfolk and this is my social data just in case you want to reach out. In this presentation I will be explaining to you what is the motivation to implement support for user name spaces in Kubernetes, that motivation is the risk of running containers as load. I will explain to you what user name spaces are and why they are important for the security of a Kubernetes cluster. And then I will show you a bit of history and also the work that we have been doing with the community in order to implement support for user name spaces in Kubernetes. I will also present a demo about the proof of concept that we have implemented. And finally, I will explain what are the steps that we have to do in order to have this support implemented in Kubernetes. So let's get started with the problem. The main motivation of implementing user name spaces in Kubernetes is that running containers as load is very dangerous. When we say that a container is running a load, what it means is that a process inside the container is running as load. This process is able to perform privilege operations on the container and the host is usually protected by the isolation that the Linux name spaces provide. Unfortunately, this isolation is not perfect and in some cases such a process could be able to escape the container. So if a privilege process is able to escape the container, this will be very bad for the host because that process will be running as load on the host and will be able to do a lot of damage there. These are some examples of some vulnerabilities that have been found in the last year that could have been mitigated by user name spaces. The last one is a special critical one because in this case an attacker is able to overwrite the run C binary in the host by just using a special container image. So after that happens, the attacker will have full control over the host machine. As I told you, these are just some examples of vulnerabilities that have been found in the past and we think that this is very probably that in the future there are going to be more vulnerabilities and also this is possible that there are some current vulnerabilities that are being exploited by the attackers. Let's see some of the mitigations that we could use in order to load the risk of running root containers. But actually the first question that we have to do is do we really need to run the containers as root? In many of the cases the answer to that question is not, many of the containers that are running as root today could be running without root, without any issue. If we don't need to run as root then we could use this run as user and run as group in coordinators to change the user and the group that is used to run the pod. This is from a user point of view. From an operator point of view we could use the pod security policies and also a mechanism like the open policy agent to control who is able to run containers as root. Other mitigations that we could do is to limit the set of capabilities that we give to the pod just to give the capabilities that the pod needs and not give any extra capability to the pod. Also another possibility is to use orthogonal security features that are supported by Linus and also by Kubernetes and there are other things that we could do to mitigate the risk. One of them is to use an immutable operating system. Okay, so let's go into some details of user name spaces and why they are important for increasing the security. The user name spaces are just another kind of name spaces in the Linus kernel. This time the resource that is isolated are the user and the group's ID and also the capabilities. About user and group's ID we can have a process that has different IDs inside and outside and user name space. Especially we could have a process that is running as root inside a user name space but that process is running as non-root outside the user name space. The relationship between the IDs inside and outside the name space is controlled by an ID mapping. So what we usually have is all we have in the initial host on the host user name space. We have all the IDs available and then we assign a portion of the range to a pod. In this case we are saying that the UID 0 on the pod is going to be 100,000 on the host and we are going to map 64 key IDs. We can have multiple pods. For instance in this case we have a second pod. The only difference here is that we are mapping to the ID 200,000 on the host. This is interesting to notice in this example that both pods have a different UID on the host. So this is what we call non-overlapping ranges on the host and this is important from a security point of view because we are able to isolate the pods. It means that we could mitigate some of the attacks that one pod could perform against other pods. Another important feature of username spaces is that they also isolate the capabilities. So when the kernel performs the capability check it takes into consideration what is the username space, a process running in in order to understand if the process is able to perform the privilege operation or not. There is a relationship between non-username spaces and username spaces. Non-username spaces like the network name space, PID name space and so on are always owned by a username space and a process that has a capability on a username space can only use that capability on resources that are owned by that specific username space. I know that this is not easy to understand from that definition so let's go into an example to make it more clear. So in this case we have a host, we have the initial host, the initial username space on the host and we also have a network name space on the host. In this case the network name space on the host is owned by the initial user name space. In this example we also have a pod, this pod is running with a different username space, this username space is owned by the username space of the host and also the network name space of the pod is owned by the username space of the pod. Let's suppose that we have a process that is running inside the pod what it means is that the process is running inside that username space and inside that network name space and the process wants to perform a privileged operation on a resource that is controlled by the network name space of the pod. For instance let's suppose that the process wants to bind to the pod area that requires a capability so in this case the kernel does the capability check using the username space that is the name space where the resource is owned. So in this case the process has that capability in the username space and the process is able to perform that task. On the other hand let's suppose that in this case the process wants to open the same pod but this time on the network name space, the network name space is owned by the initial username space, the capability check this time is performed against the initial username space but this time that process doesn't have the capability in that username space hence the process is not able to perform that operation. What is important from this example is that by using username space we are able to give some capabilities to a process running in a pod so the process can perform privileged operations on the resource that are owned by that pod but at the same time that capability is not meaningful for resources that are not owned by that pod for instance from resources of other pods or for resources on the host. Okay let me talk about the username space supporting coordinates this is the number of the cap that we have been working on. Before going into details let me show you a bit of history about the username space's coordinates and so on. So the name spaces were introduced in Linux almost 20 years ago and 10 years after in 2013 the username spaces were implemented in Linux. In the same year Docker was launched one year after Kubernetes was launched and in 2015 the support for username spaces and the OCI specification was introduced and finally in 2016 there was the first issue talking about supporting username spaces in Kubernetes. In 2018 there was a proposal for providing some support for username spaces in Kubernetes and the same year there was also a PR with that implementation for Kubernetes. Unfortunately that PR was closed so the support was not merged into Kubernetes and then last year in 2020 we opened a new proposal for implementing this support in Kubernetes. And also in 2021 this year the support for ID map in months was merged in Linux. I will talk a little more about why this is important for this work that we are doing. You might be wondering if username spaces are that old they were introduced to Linux almost 80 years ago and many of the tools there are already supporting that for instance the different containers and signs, the OCI specification and so on so why Kubernetes is not supporting username spaces? This is not that easy to do and there are some challenges that we have to overcome in order to provide support for that. One of the challenges are the duplicate snapshots so when we use username spaces in some cases we have to create a copy of the snapshots and we have to show to fix the ownership of that snapshot in order to create the container so what it means is that when we create a container that is using username spaces we need more time to copy the snapshot and to change the ownership of that snapshot. Another challenge that we have is the support for volumes. The problem is that if we have multiple containers, multiple pods that are using a different ID mapping those pods are not able to share files using volumes because the host ID of those pods is different so there could be some issues while accessing those files in a shared storage. Another of the challenges that we have is to support high IDs so what we usually do is to assign to each pod a portion of the whole host username space or whether we assign a range of the IDs of the host to each container so a process that is running inside that container is not able to use all the IDs that will be available on the host. There are some solutions. I would say that in some cases more can work around that solution but this is the best we can do right now. The main solution, the main workaround is to run multiple containers with the same ID mapping by doing that we solve the problem of the volumes we also solve the problem of the duplication of the snapshots and so on The key point here is what is the granularity? How many pods do we want to share that shares the same ID mapping? Okay, so for that we have defined different ID mapping modes each of them has a different granularity. The first idea that we have was to support a cluster mode in this case all the pods within the cluster share the same ID mapping Another idea that we got from the community was to support an in-space mode in this case all the pods in the same coordinates in in-space share the same ID mapping There was also another idea about service account this is the same as before but in this case the ID mappings are assigned for each service account and finally the other idea we have is to implement a pod mode where each pod gets a different and non-overlapping mapping So the idea is not to implement only one of them the idea is to see what of them makes sense and which one of them which will implement so let's make a comparison of the different modes So in the left we have the different modes I just presented to you and on the top we have the different challenges that we have and also the pods-to-pods isolation offered by each mode So in the case of the cluster mode for the duplicate snapshot issue it provides a good solution because we have all the pods sharing the same ID mapping on the cluster so basically we are able to reuse the same snaps for all the containers in the cluster In the case of the in-space and service account modes this offers a more or less good solution, let's say a medium solution because in this case we are only able to reuse the snapshot in the same in-space So we have a container that is running in two different in-spaces or two different service accounts and we will have to create different copies of the snapshot for those containers For the pods this is really bad, in this case all the pods are using different ID mapping we have to create a different snapshot for each pods Regarding the support for volumes for the cluster mode this is pretty good because again we have the same ID mapping for all the pods so we are able to share a volume with all the pods in the cluster For the in-space the support is more or less medium because in this case we are able to share the volume across the pods inside the same in-space also in service account but we are not able to share that volume between pods that are running in different in-spaces And finally for the pods case this is again very bad because we could only share a volume with a single pod Regarding the support for high UIDs the cluster offers a very good solution for that because in this case we could assign a pretty big range of UIDs to the cluster and then each pod within the cluster will be able to use all those IDs For the in-space or service account modes this is bad because in this case we have to coordinate the allocation in the different nodes of the cluster and we also have to guarantee that the probability of a collision is very low what it means is that we can only assign a low number of UIDs to each in-space or to each service account and for the pods case the solution in this case is good because we don't have to coordinate the allocation between the different nodes so in this case the only thing that we have to guarantee is that the pods in each node doesn't get an overlapping mapping so in this case we are able to give to each pod let's say a pretty big range Regarding the pods to pods isolation in the cluster case this is very bad of course we are sharing the same ID mapping so basically there is no isolation between the pods in this case For the in-space and the service account case this is more or less medium we have some isolation between pods running on different in-space or different service accounts but we don't have isolation for pods running on the same in-space or service account For the pods case it offers the developed isolation because in this case we have a different ID mapping for each pod so basically the pods are fully isolated Ok let me show you a demo about the proof of concept that we implemented with the support in Kubernetes In this demonstration I will show how we can protect the host against one of the vulnerabilities I showed you before by using username spaces I will use just the RunC vulnerability so what I have here is a RunC binary that has that vulnerability this is a very old version that was released before that vulnerability was fixed and here I have a local Kubernetes cluster running on my machine what this vulnerability does in this case is that it writes a string to the RunC binary so before running the exploit let me show you that we don't have that string on the RunC binary let me show you what is the exploit I'm going to use this is just a standard pod template the only important part here is that I'm using a special image that contains the exploit for that vulnerability so let me create that pod let's wait a little bit until that pod is running so this is already running let's look at the logs of that pod so we can see that in this case the exploit says that the RunC binary was updated let's check if that actually happened so yes in this case we can see that the exploit in the container was able to override the RunC binary and in this case it just adds a string to the binary let me reinstall the original RunC binary the string is not present anymore let me show you how we can mitigate that vulnerability by using user namespace so this is another pod template this is almost the same template I have before the only modification is that in this case I'm enabling the usage of user namespace by using the cluster mode I displayed before so let's create this pod let's wait a little bit until that is running ok so the pod is done let's get the logs of the pod so in this case we see that there was an error what it means is that the exploit was not able to modify the RunC binary because in this case the UID that the exploit was using on the host was not the owner of the RunC binary in fact if we check the RunC binary we can see that the string is not present in that case this is just a quick demonstration of some of the vulnerabilities that we could prevent by using user namespaces ok so what are the next steps that we have to do in order to implement this support in Kubernetes probably the first thing that we should start doing is to consider the ID map and month feature that is now merged in the Linux kernel so this is a feature that will be available in Linux 5.12 and this is a feature that provides a real solution for the duplicate snapshots and the volumes sharing issues in other words with this feature we are able to implement the pod mode without all the complications that we were having so if we look at the modes comparison again taking into consideration the ID map and month feature where we can see that now the duplicate snapshots and the support for volumes in all the solutions we could have an excellent support because we have support for that in the kernel so this is something that we don't have to care about the support for how you ID is on pod to pod isolation is just the same so in this case we will have to make the decision about which modes to implement based on these only these two features and we don't have to consider any more the duplicate snapshots on the support for volumes issues as we can see in this case the pod mode looks like a very good candidate to implement because it has good support for how you ID is and also it offers excellent pod to pod isolation okay so just before finishing this is just some reference material just in case you want to know more details some documentation about user names, spaces, capabilities and so on in the Linux kernel two blog posts we wrote about the work that we have been doing with the community also the pull request for the CAP and some more information about ID map and months I think this is all thank you very much for your attention and I'm happy to take any questions that you can have