 Hello everybody, welcome to this presentation making Kubernetes safer with user namespaces. And Mauricio, I work as a software engineer at KingFolk. In this presentation, I will explain to you what are user namespaces and what are capabilities in Linux, what is the problem that is about running containers as a root, how user namespaces mitigate the problem, and what is the current status of supporting user namespaces in Kubernetes. And finally, I will give you a demo about using user namespaces in Kubernetes. Before going on the details about the problems and the solution we are looking for, let me introduce you, the user and website is on capabilities concepts. In Units, the users on the group of these are identified by an integer, those are usually called UID and GID, and each running process has a set of associated UIDs and GIDs. In the Units context, there are two categories of processes. There is what is called the privileged processes, and those are the ones that run with the UID0, and they are also called the root processes. These processes bypass the kernel permission checks. Those are the privileged processes that are able to do almost anything on the system. The other kind of processes are the unprivileged processes. Those are the ones that run with UID different to zero. In this case, the kernel performs the permission checks based on the credentials. If the given user ID or group ID is able to execute an operation or not. Another concept important here are capabilities. Capabilities are the privileged actions that a process can execute. Those are divided into different units, and those are called capabilities. For instance, if a process needs to change the clock of the system, it will have to have the capsized time capability, and so on and so on for other operations. Basically, there are a set of operations that a process can execute. A process has a set of these capabilities, and based on that, the kernel decides if a process can execute or not a given operation. What is the problem that we are trying to solve? Basically, the problem is that there are many people running the containers as root. Running a container as root means that the process inside the container is run as a root. It's run with UID zero or a privileged process. It means that it can perform privileged operations on the container. In this case, the host is only protected by the isolation that is provided by namespaces. We have the process running inside the container, and hopefully this process is not able to perform any operations on the host because this is isolated by namespaces. The problem is that namespaces are not perfect, and in some cases, there are some vulnerabilities, and a process running in a container is able to escape those namespaces and affect the host. In this case, if the process is running as root in the container and it is able to escape the container, it could affect the host in a very bad way because that process will be running as root in the host too. These are some examples of vulnerabilities that are exploited because processes are running as root inside the container. I will show you in the demonstration part the last one with some more detail. How to mitigate this problem? The first question that you have to ask yourself is if you really need to run the containers as root. Probably the question is no. Probably you don't need to be running the containers as root. What happens is that this is the default option in most of the tools, so each time you build a Docker container image, the default user is root and the people just don't care about that and they keep using the default option. If you don't need to run the container as root in the core nettest, you could use this option to run as a user and run as a group in the ported specification to change the user. If you are a cluster administrator, you could use the port security policies or the open policy agent to deny running containers as root or better, to control who is able to run containers as root. If you really need to run the container as root, then you could limit the set of capabilities that you give to the process so you can set only the capabilities that this process needs to work and you could also use another orthogonal security feature by set-con and uparmor and so on. These are just mitigations. Let's look into how to solve the problem and how to add an additional security layer that protects us against that problem. This is user namespaces. User namespaces are just another kind of namespaces. These kinds of namespaces isolate the user and groups IDs and also the capabilities. Especially a process can have different IDs, different groups and user IDs based if the process is inside a container and outside the container. What is most important for us is that if a process is running as root inside a user namespaces, it will be running actually as non-root and the host user namespace. These ID mappings are controlled by these two files, by the UID map and the GID map files. So what happens is that we have the initial host user namespaces that has 4 billion UIDs and then we give a range of those UIDs to the user namespace, to the container. For instance, here we have the container one and we say in this configuration that the root UID on the container one is going to be mapped to 1,000 in the host and this is the number of UIDs or in general IDs that are going to be mapped. So a process running as root in container one will be seen as running as 1,000 on the host. If we have another container, we could have a similar configuration but in this case we are giving to the container the 200,000 ID on the host. So in this case both containers are running as root inside the container but the ID used on the host is different for them. Another important point about user namespaces is that they isolate capabilities. So the kernel in order to check the permissions of a process it uses the user namespace the process is running in. There is a relationship between other kind of user namespaces like network, PID and so on that they are owned by user namespaces. And a process that has a capability on a user namespace can only use that capability on resource that are owned by that user namespace. More details in the nest is like that is your standard son. So in this case we have a standard sector, we have the initial username space on the host and we have the network namespace on the host that is owned by the initial user namespace. In this case we have a container that is running in a different user namespace and the network namespace of that container is owned by the user namespace of the container. Let's suppose that we have a process running in that container and means that the process runs is inside this user namespace. If this process wants to open for instance to open for a in the container network namespace the kernel will check for this capability the cap net buying service capability in the parent of the network namespace that is the user namespace of the container. In this case the process is running in this namespace so the process has the capability and can perform that operation. On the other hand, if the process tries to open a priority in the host network namespace this is not going to work because the parent of the host network namespace is the initial username space. The process is not running in that username space and hence doesn't have the capability there and the process is not able to perform that operation. What is important to notice here is that with username spaces we are able to give some capabilities to a process but those capabilities can be used only in the container context and those capabilities are not valid for the host. Okay, so I already explained to you what are user namespaces, what are they used for? We are trying to reduce the risk of running real containers in Kubernetes by using username spaces. This is an all pending issue, something that hasn't been done yet. The current status of this support is that we are iterating in the Kubernetes engagement proposal. We are communicating with the community and trying to find a solution that works well for most of the cases and we implemented a proof of concept in Kubernetes and container DE. This is the demonstration I will do. So maybe you are asking if username spaces are so nice if there is a problem with growth container why there is not support for username spaces in Kubernetes yet? Well, there are some challenges that are difficult to solve and that's the reason why we don't support username spaces in Kubernetes yet. The first challenge is about volumes. So if we have two containers that are running with different ID mappings that are not able to share files using a volume because the UID on the host is different. So when a container tries to access a file that was writing by the other container, the UIDs don't match and there could be some permission problems there. The solution for this problem will be to have a chief FS or similar solution in the kernel. This is a mechanism that performs an ID translation when a process inside a username space is trying to access a file. But unfortunately, it is not supported in Linux maybe yet, we don't know. There is some work on going on that. There is this ID map among that is a similar idea and that Kubernetes will benefit for that. But the point here is that we don't have the support for that in the kernel. So we have to look for a different solution right now. The solution for that is to use the same ID mapping for containers that are sharing volumes. So in this case, the UID on the host is the same and the containers can share files to all the volumes. This of course reduces a bit the pod to pod isolation because the UID on the host is the same but still the host is rotated because the root in the container is non-root on the host. Another challenge we have is the high UIDs. So a process running in a container cannot use all the UIDs. All the UIDs are not available because we displayed the host range of UIDs for each container. So to each container, we gave a small set of UIDs. The solution we have for that right now is to give to each container as big as possible range, something like 64 key. And in some cases to avoid using username and space for those containers that require even higher UIDs. In the future, we could have solutions like as more analysis of the IDs that are required by a container, this is to inspect the image of a container to see what are the IDs required there. Maybe to extend the OCI image specification in order to add information about what are the UIDs and UIDs that are container needs. Or yeah, to have a different kind of support in the kernel for username and space. So in a way that is username and space could use the full range of UIDs, but yeah, this is just an idea and probably this is something that is not going to happen in the near future. In this demonstration, I will show you how to deploy a pod using username and space in coordinate. In this case, we are using the proof of concept we implemented in Kubernetes and container. The proof of concept implements some of the ideas we have discussed in this Kubernetes proposal. I have a cluster running on my machine, so let's get started. The first thing I want to show you is how to deploy a standard pod. This is the standard way to deploy a pod without using name and space. So please notice that we don't have any username and space related configuration here. In order to deploy a pod with username and space, we have this new username and space mode field and the pod specification. In this case, we are setting this to cluster. Cluster means that all the pods running in this mode are going to share the same ID mapping. This is to allow them to share files using volumes, but I don't want to get into the details of this username and space mode and what are the different modes that are supported because this is still something that we are discussing with the community, so we don't know if this is going to be merged in this way or if there are going to be changes and that. So I only want to show you the general concepts here. I want you to know that the common in both pods is different, which I'll show you just in a second why. So let me deploy those pods. Okay, and let me show you the mapping, the UID mapping in those pods. So for this case of the pod running in the host username and space, there is no mapping. He means in this case, we can see that the root user, the root UID on the container is mapped to the root UID on the host on the full length of the range is mapped. So actually means that there is no mapping going on. On the other hand for my full username and space, the pod running the username and space, we have that the root UID on the container is mapped to these 500,000 ID on the host and that we have a 64 key long range. Let's look at the processes that are running and how the process is running on my host. So we have this process that is the one running in the container without username and space. We can see here that it is running as root on the host, but this process running on the container on the pod with username and space is running as we, as it should be as 500,000 on the host. Okay, so it only shows that the root, a root process running in the container is running as no root on the host when we enable user name spaces. I want to show you a vulnerability that could be mitigated by using username and space. So some time ago there was found a vulnerability on grant see that allows a special container to override the grant see binary on the host. I'm running a null version of grant see that is affected by that vulnerability just to show you how this vulnerability could be mitigated with user name spaces. What the exploit does in this case is to append a string to the grant see, to the grant see binary on the host. So we can check here that the grant see binary doesn't have this string inside. Let me show you the spec of the exploit. So in this case, this is a standard pod, not a note that we are now using username spaces, but that we are running this special image that contains the exploit there. Let me deploy that image and let's wait a little bit until that container is done. This is done, let's look at the logs of that. So we can see here that it says, okay, the grant see binary was updated. Let's check that. Effectively we have this new string on the grant see binary on the host. Of course, this just is an example. This is not too bad to a string to a binary, but in a real attack this will be very bad because in this case the attacker is able to introduce whatever code they want into the grant see binary that will be executed as root on the host when a container will be created. Let me reinstall the grant see binary. We can see here that the string is not present anymore. And let me show you how username spaces call mitigate that vulnerability. So in this case, we have the same image. The only difference in this case is that we are running the pod in a user name space. Okay, so the pod is completed. Let's look at the logs for that one. In this case, we can see that there was an error opening a file descriptor because there was a permission deny error. This is because the exploit that is running as root on the pod is not running as root on the host. Hence, it doesn't have the permission to override the grant see binary on the host. If we look at the grant see binary, we can see that the grant see binary doesn't include this string. So what it means is that the grant see binary wasn't modified by the exploit. This is an example of one of the vulnerabilities that could be mitigated by using user name space. Okay, this is all. I think there are a couple of minutes for any questions. Thank you very much.