 Welcome to the talk, Running Non-Root Made Easy. My name is Lubosal Piwacz, and I will be your guide throughout the presentation. So who am I? I am a software engineer at Red Hat, where I work on Kubeberg for a couple of years, specifically three. It is a very interesting project, and we had interesting problems to tackle. And that's why I'm here, to share the problems and the solutions. I'm also a tutor at my university, where I graduated, and I teach our students to tackle the problems and provide them with tools and solutions how to solve them. I hope I will be as helpful as for them, and you will get something from this presentation today. So a little bit about Kubeberg, where I come from. It is a Kubernetes extension to run virtual machines inside the Kubernetes, side by side with containers. This has the benefit that you actually can reuse a lot of the stuff, for example, networking. If you want to communicate from containers into the VMs and the other way around, it's totally possible. We, of course, use a lot of features. We use a lot of things which you use for your containers or workloads. That's a very much great benefit, and there is a lot of use cases. I could be here and talk about Kubeberg for a lot of time, but I'm going to try to focus down a little bit. There is no magic behind. Let me fix the issue. I apologize. So there is no magic behind Kubeberg. We are running the virtual machine inside the pod as you would run your arbitrary workloads. This has the benefits which I mentioned, but also the same problems as you have with containers. We recently transitioned to running non-root. The pod we use was using a lot of capabilities and privileges, and we wanted to do better, be more secure. And these problems find in the retrospective. I have seen that these problems are not specific to Kubeberg, but they are actually common, and you will probably hit them too. If you are not convinced that you should be not running the non-root inside your clusters, let me give you a little bit of speech. It goes without saying that you would follow a principle of least privilege, which would reduce the surface which somebody can actually exploit, and if the day comes and somebody exploits your application, it would be more harder for them to then exploit your system. So you are containing the radius of the pod. It is also well known that if you are not running as root and you get to be exploited, the attacker actually needs to somehow get the privileges in order to then proceed with malicious actions on the host. There are many people talking about security and specifically about running root in the production. But there are also many tools, such as bot security standards. This is a feature which was replaced, or it is replacing PSP, as you may know it. And there are three type of policies you can enforce. One of them is the restrictive one, probably the goal for you. Right. I will paraphrase what the restrictive policy says about itself. It is aimed to better the hardening practices, but on the expense of actually compatibility. And you can actually say that not only compatibility, but on expense of user experience, on expense of running your workloads more easily. Of course, maybe today or in the future, you will be required to comply to some compliances. So it's probably going to be, the root is going to be included in the compliance. And finally, why these people and actually the tools are saying you this? Well, today we have the root in container. And the root in container is the same as having the root on the host. So if you actually escape the container, you are most likely going to have the root on the host. This is true, as we don't have really user night spaces yet. Well, we have them, but only for a couple of configurations. I'm not going to speak about user night spaces today, but there is a talk that is going to tomorrow. So check it out. And let's move to the problems before that. What's the goal? Right? So for this talk, I'm not going to use QBird because that's more complex. I'm going to use an Nginx web application web server, which you probably know. And maybe you are running it if you use Nginx Ingress, for example. So today, your workload may look like this. There is a root, user 0. And in the end of the talk, we want to run as arbitrary user, for example, 1,000. How do we get there? Let me remind you how does the access control works in the Unix. Every file, directory, socket, binary gets to be assigned UID and GID. Those are only numbers representing the user and the group. We can then associate the names for these numbers, which we do also, for example, for IPs, right, with DNS. And for these categories, we then assign a permission list. The more critical ones are read, write, and execute, or execute. Based on this, Kernel then checks if some user has actually access to the file, binary, socket, or anything else. Why am I speaking about this? Well, this is actually the source of most of your troubles, access control. And it is a good thing, because we are trying to harden the security. And what the workload actually can do inside the pod. Images. What is the first thing you are trying to do when you are trying to run application inside Kubernetes? You are choosing or building an image for your application. This is also the first thing which goes wrong. Why is that? Well, the image represents the file system, which will be available for your application. Therefore, it means the permission bits are hardcoded inside the image. And if you are going to change for the user, which should not have access to anything or everything, then you get actually troubles. As I mentioned, you either build your images or consume. It's up to you. To consume an image is maybe a convenient, but also it gives you less flexibility. You are trying to find an image that is already able to run as non-route, which, fortunately, is becoming good to practice. The maintainers of images are actually included images that are for unprivileged user. Depending on the tool you are using for, as I mentioned, some of the images may not be suitable for privileged workloads. But in the case, we have the beauty of images that we actually can take the one and modify them as we wish and as we need in order for them to run in our cluster. There are many tools we can use to build container images. And depending on your tool, you can actually choose one of the approaches to building images. One of those approach is to use package managers. But these package managers are really opinionated. Sometimes they are packed as for only a route. So keep this in mind. And you can always build a thing from the source. But probably it is not the most convenient thing to do. So what does actually happen when we try to run nginx? Well, when I look at the logs, I will get this message. The important one is inside the red. We fail to actually create directory at some kind of path. OK, that's probably a good thing because we see that the security is working. But how do we tackle the problem? Well, it could be really annoying to modify the image on one path and then run it again. So in order to do a better job, we can use the existing tools. One of those tools is Strace. It is used to track system calls and signals. And this can be used to determine the path that you are touching or creating. We can simply modify the demo to include the Strace inside the image, and then to execute the Strace at first and then give it the other application as the argument. After that, we get a lot of logs. It can be verbose. So please refer to the documentation, how you can actually filter the calls you want to see. What is important on the image is that we see the call for makedir. And we see what path we need to handle. We will see it for all of the paths we need to handle. But please keep in mind that it is not enough to start the application, but you actually need to execute some kind of functionality because the application can create or open the files on fly. Modifying the image is one way, but can we do actually better? Yes, we can, because we have emptydir, one of the types of volumes which is bound only to the pod lifecycle. And we can use it to bound to some path which needs to be available to the application. Keep in mind that when we do this, the content of the directory is going to be rewritten. So if there is anything usable and required for your application, you can use this approach. But how do I know if there is something on the path? Well, again, we can use existing tools, such as dive, which I used before, which lets you to inspect the layers, your image. This, again, reduce the turnaround you need to do. So on the image, you can see on the left, there are layers. And on the right side, I get the file system and the permissions and the owners of the files directories and more. I want to highlight that it's also very important to build efficient images. I'm referring to a copy with change ownership argument. If we compare these two approaches to building the image, we can see that the final product is very different. The sample file has one gigabyte of data in it. And if we execute the change ownership inside the run command, we actually create a new layer which will copy the file as a void. So please refer to the documentation of the tools you are using in order to actually do the changes more effectively. Now, when we actually solve all the permissions on the files and directories and all the things we need to touch, we can run the image again. But it's not going to work because as you can see, we fail to bind on port 80. The port 80 is also known as well-known port. There's a range of ports which are significant to the system. And therefore, the system requires a capability called NetBindService. So let's have a look on how can we actually approach this problem. Well, we can have a look on the reference of the API and add the capability pretty simply, as you can see on the image. But it doesn't work. What is a capability? By definition of the port, it is ability to execute a specified course of actions or achieve a center outcomes. Capabilities are the reason why we actually distinguish two types of processes, the privileged one and the unprivileged one. It is not a route which is very special, but it is the capabilities which makes the route special. In order to better explain what happens and why it is not working when we change the user for non-route, I need to tell you that as a process, process has a couple of sets of capabilities. The most important ones are the permitted and effective. Permitted is a limited set for the effective one. And the effective one is where you need to have the capability on order for the kernel to allow you the action or give you the outcome. The solution, I forgot to mention, so what does actually happen when we rewrite the user for the non-route? Well, what happens is that the permitted and effective set is zeroed out, meaning there is no capability for you and you need to do some changes. There is two types of changes you can do for now. One is to actually modify your application. This is, of course, on expense. The application needs to be aware of capabilities. You need to program it. Of course, there are some images, libraries for this. But it is a work that you probably don't want to do. Kubernetes API is beautiful in those things, that what you actually specify that you want, you are going to be given that. But this promise is not here. So what are the steps you need to do in this case? Well, you need to find out what capabilities you actually need, so probably refer to the documentation. Then you need to actually programmatically look up what is the effective set for you and then ask a kernel for the capabilities to be added to the effective set in order to be able to perform the actions. After that, you can continue with normal application logic. This is probably not the best solution for everyone because you don't want to modify the application or you actually don't have the source code for it. So in order to tackle this differently, we have something called file capabilities. File capabilities are capabilities that we actually can attach to the file. And when the file is executable and we execute it, the process, which is going to be created, is going to inherit the capabilities. This, of course, requires modification of the image, which is also not very convenient. How do we do it? Well, we set the capabilities on the binary we are going to execute. You can actually, for example, you can use a set cap program, which is available on most of the Linux distributions. This is not a perfect state. Can we do it better? There is already cap for ambient capabilities. Ambient capabilities is just another set, which allows us to use the capabilities in the same way as we use them for root, meaning when we specify them on the YAML, we actually get them on the process. There is an issue for that, and of course, there was also some work which was already achieved. The CRI got updated, and now we are probably waiting only to use the API. In order to get this feature, I would probably ask you to weigh your opinion, either on the cap or on the issue. Finally, we were able to run the Nginx, but for some reason, we need to accelerate the workload. Let's imagine we need to pass to the container some GPU. This is a problem because the device is going to have the same owner and a group as it would have on the host, which only by accident can match the user you are using inside the container. In any other case, you are not able to use this device. There is a great Kubernetes block by Mikko Linen, who is going a little bit more into depth, but to give you the most important thing, you actually only need to adjust the container runtime you are using, so for example, cryo or content tree, and here is an example of what you need to do in order to achieve it with cryo. After that, the container is going to be able to access the device, but for some reason, we have different use case and we need to share a volume between two pods. The provisioned file system is not going to have standard permissions because there is no standard for that, meaning, most likely, your user will not be able to use this persistent volume. If we even get to the point where it is working, we need to keep in mind these two pods need to use the same user. The solution for this problem is really simple and you probably already know it. You need to set the FS group. What does it does? It recursively changed the ownership to the specified group and set a special bit permission bit, set git, which makes sure that all the new directories and files are going to be inside the group you are going to specify. And therefore, you can access the volume. But what about block volumes? As the name suggests, FS group is not going to work here because we have actually a device. So in order to solve this issue, we need to go back a little bit and modify the container runtime in order for devices to be accessible for the root. So same solution as for devices. After we applied it, it's finally working. Last special case. How many of you actually run a root inside the production? Can you give me a hand? Okay, thank you very much. And how many of you are running under Selenux or up Armour? Not that much actually, or maybe I don't see. So, simply put, what is a Selenux? Selenux give you a label to everything inside the system. These labels here on the image, it has some user role and type. The type is the important part. The types make sure that actually the workload inside the container can only touch the file system and every other thing inside the container. It cannot really touch anything on the host because it is of different type. And the second important part is the categories, those numbers. They represent a unique, let's say, identifier. Meaning, if I have 208 on one port and 202 on the second port, this port cannot touch the things of the others. This gives us a better security. Please do not turn off the Selenux when you see the issues. Do not be as somebody I know in the past. So what are the quirks with Selenux? Well, this is a very specific one, which we actually stumbled upon, and most of you will not see it because it depends on the way you actually build your images. I have talked about setting the capabilities on the file, right? Well, this also means that you will actually set a label on the binary. And if you are, for example, using TAR for producing the layers for the image, you may also pack this label into the image. So when you run the workload and the image is going to be extracted on the node, what happens when we execute it binary? Well, the label on the underlay is going to be the efficient one, the one which kernel will check, and therefore it will see that there is a different type, which we set by accident by using the setfcap. Therefore it's going to fail, and these kind of issues are very, very hard to debug. So I hope if you see this issue, you will probably know how to solve that. A second problem which we stumbled upon is sharing a file system between two pods. So yes, one thing is permissions, but the second thing is actually sendings. Why? Well, when we spin up the pod, the keyboard will see what type of categories we used, and it will apply the category on each file and directory on the file system. But when we spin up the second one, it will do the same, meaning the first one will lose the access. This is probably not the state we want to be in, right? The solution is simple, but yet not very elegant in my opinion. You need to manage the categories by yourself. You need to specify them on the pod.yaml. Meaning you can actually get a conflict of the categories, and by accident, have two pods with the same category, which is not the best thing for security. There is a new feature in PsychoBrandy called mounts with context, which allows the CSI to mount the file system with some kind of context, specifically with Cellinux label. And in this case, the content on the file system is not touched, and therefore the process, the pod, the second pod and the first pod, will see the right categories and will be allowed to interact. Let me have a few last words. Companies, contributors, individuals, communities are investing into security a lot. A lot of resources, a lot of effort. But if we don't make the security features usable, easily usable and conceivable by the end user, we are not going to run a secure environment. Therefore, we need to do a little bit of retrospective, and when we design a new feature around security, we need to make sure these features are going to be usable by the end user easily, so they can adopt it at a massive scale. Thank you very much for your attention and for your participation.