 I'm Mario Batket, Senior Software Engineer at Red Hat, and today we are going to talk about continuous security and introduction to capabilities and second profiles. Let's get started. So first things first, what are Linux capabilities? Before jumping on what capabilities are, we need to understand that in Linux we have two types of processes. Privilege processes, whose effective user ID is zero, also referred to them as super user processes or root processes, and then unprivileged processes, whose effective UID is non-zero. Privileged processes bypass all kernel permission checks. On the other hand, unprivileged processes are subject to full permission checking based on the processes credentials, usually effective UID and effective UID, and also supplementary group lists. Starting with kernel 2.2, Linux divides the privileges traditionally associated with super user into distinct units, known as capabilities, which can be independently enabled and disabled. And those capabilities are upper thread attribute, meaning that every thread will have its own capability assigned. So if you look at the image in this slide, you can see that we have here a box with root. This is before the Linux kernel 2.2. At that point, root processes were assigned full privileges, and you cannot remove privileges from those processes. Then after Linux kernel 2.2, we have the same root process, but now we have different capabilities, which are the blue boxes here. So in this case, we can assign different privileges to the process. In this image, this represents that this process has the whole root privileges like it was in before Linux kernel 2.2. And then we have this third image where we have the same root process, but now we only assign three capabilities in this case, which means that this process is only able to use those capabilities and not the other ones that were removed. So it doesn't have the whole root privileges enabled. Some examples of capabilities. Well, we have, for example, the net row, which will allow the process to use row and packet sockets. Then we have, for example, shown that will allow the process to make arbitrary changes to five UIDs and GIDs. And there is a lot more of capabilities that are like around 30, if I recall correctly, but anyway, at the end of the slides, there's a link where you can go and check the whole list of capabilities present on the Linux kernel as today. Why we talk about capabilities? Because sometimes we will need to use them on containers. And in fact, container runtimes already use capabilities. So for example, you can check the default capabilities enabled by prior on the link on the slides and also the ones enabled by default by container D on the link on the slides. So at the end of the day, as you know, containers are processes running on our system. So we can assign those capabilities to those processes which are containers. Default capabilities enabled by the runtimes are assigned to every container. If the user does not add or drop additional capabilities, all your containers will get the capabilities that are defined as default on the runtimes. Then one question that you will get during the presentation, which is, okay, how can I know which capabilities are required by my application? So there is no magic tool that will tell you which capabilities are actually required. The developer needs to understand what capabilities will be required by their application. For example, if we have an application that listens on port 80, and we want to run that application as a non-reducer, which means a non-privileged process, we will need the net-bind service capability, which allows a process to bind a port below 124 without being privileged. So that's something that the developer needs to know. There is no tool that will give you that information. Okay, so as you can see here in the screen, there's a bunch of text. There are five capability sets. I don't want to get into too many details. We will see how they work doing the demo. You just need to know that those capability sets are used by the kernel to run some permission checks. And depending on the configuration of the different sets, your process will be able to get some capabilities or not. Then on top of threat capabilities, we have file capabilities, which are capabilities assigned to an acceptable file, and that upon execution will be granted to the threat. The file capabilities are stored using one bit on the file system, but they act as different file capability sets. So they are not sets actually, but they work like if they were. And for file capabilities, we have only three sets permitted, in a table set, an effective set. You have seen those on the previous slide. As I mentioned before, we will see this in action. So let's go with demo one, capabilities on containers. In this first demo, we're going to see how we can run a container locally and get its threat capabilities. So in order to that, I will be using podman. And as you can see here, we're running podman, this image. And this image has an application that listens on a given port, but that's not important for now. We will be using this image during the demos. So now we have the container running. We can get the container PID using this command podman inspect. And we can always get capabilities from the proc file system. So if we query cap on the proc, the container PID status, we will get the different threat capability sets. As you, if you remember, they narrated permitted, effective, bounding, and ambient. As you can see, this returns the capability sets in X format. And we can use a tool named capsh to get them decode. For example, let's decode the effective set in this case. Okay. So now you can see that this thread has the following capabilities, shown, that override, if owner, et cetera, et cetera. When this includes such podman, we can also use commands as podman inspect and then get me the effective caps. Okay. That will be the same output. And with that, we conclude this first demo. But now what we are going to do is we're going to show the differences between running a container with UID zero versus running a container with non-root UID. If you remember, we mentioned in these slides that there are differences between privileged processes, capabilities, and non-privileged processes, capabilities. So what we are going to do in this, this time is we are going to run the same container through the same image using user, user zero. And using the entry point being vast, we will run the rep command inside the container. So that will show us the threat capabilities for the container. Okay. So now we'll run this rep. And you can see here that we have the same capabilities that we saw before. Okay. And the important thing here is that we have those capabilities in all sets. As you can see here, those are in the inherited permitted effective bounding. The ambient set, it's not set. Okay. So now we can exit the container. And now what we can do is run the same command. But this time we are using a non-root UID. In this case, we're using the 124 UID. Okay. So if I run ID, we are that one. Okay. If we run the same command, look at this. So we have all sets, all sets have been cleared. So all zeros, which means that we cannot get any capabilities because those are not in the inherited permitted set. As you can see here in the bounding set, we still have the capabilities that were present on the privileged processes. The privileged process, in this case, has created this container. And that process had these capabilities. So those are in the bounding set, as we have seen on these slides. Okay. So now we can exit. So what we can do now is we can run the same command. But at this point, we are requesting this capability, cap net buying service. Okay. So what we are doing here is we are telling the container runtime that we want to be able to use this capability. What will happen then is that the runtime will put that capability into the ambient set. Okay. So now we requested that capability. And if we run the web command, you can see the difference. So we got the bounding set as we as before with the whole capabilities. But now in the ambient, then the effective permitted and inherited, we got only the net buying service capability. So this means that this container can only use the net buying service capability. Okay. And now we can exit. Okay. For the next demo, what we are going to do is we are going to show a more real world scenario, let's say. We mentioned earlier that our application can choose which port it listens on by the use of an environment variable. So what we are going to do is we are going to try to run our application using an unprivileged user on this unprivileged port 8080. Okay. So that works. Our application is listening on port 8080. So the next text that we are going to do is, okay, we want to run this application. But with an unprivileged user, but on a privilege, we want to listen on a privilege port number 80. And as you can see, we get a permission denied because we have the privileges to bind to that port. So what happened is that even if the runtime has the net buying service on the default capabilities that are assigned to containers, as you have seen in the previous demos, the capability sets get cleared when you run, when you exit DE from a more privileged to a less privileged process. Okay. So what we need to do now is, okay, we know that we need net buying service. So what we are running, what we are doing is we are running the same container, the same user on the privilege port 80, and we are adding the cabinet buying service capability. So you can see that this time, we were able to bind to port 80, because we have the privilege required to do that. Now let's move on to file capabilities. If you remember from previous test, if you remember from the slides, we talked about something called file capabilities. So what are those file capabilities? Are capabilities that get us, that are assigned to acceptable files. So in this case, when we created our container image that we will be using, we used this command, set cap, and then the capability that we want to add, the file capability that we want to add, and then in the effective and permitted sets, and the binary. In this case, this is the binary for my application. That means that my binary when gets executed, will try to get this capability assigned to the application thread. Let's try. So again, we'll run this image. We are going to get the file capability. So we have this other command, get cap, where we can credit the binary, and it will say, okay, this is the file capabilities that I have assigned. In this case, it only has net buying service. So as you can see, the thread capabilities at this point are cleared out, but we have the bounding set. If you remember from the slides in the bounding set, the bounding set controls which capabilities can be added to the inherited and permitted sets. So if we run capital states decode, we will see that cabinet buying service is right there, which means that a file that a binary that has a file capability of cabinet buying service will be able to request that capability. Let's see how. Let's run our application in background. So it's working, as you can see, and we got this PID number five. So what we are going to do now is we are going to get the thread capabilities for this thread. And as you can see, this thread, the thread for PID number five only has this capability net buying service added to the thread's capability, because that's the one that the binary requested by a file capabilities. But at this point, maybe you can think that you can bypass capability checks, bypass thread capability checks. But let's see if that's possible or if it actually it's not. So we have this command, podman run, same thing. We are using the same entry point. Privilege user, and we are requesting the runtime to drop all capabilities. We are still setting the app port to port 80, and we are using the image that has the file capabilities. Okay. So if we look at the capabilities, now everything every set is clear, including the bounding set. If we run the get cut user being reverse course, you can see that it has the file capability. So let's see what happens if we try to run this binary. Okay. The kernel blocks the execution. Why? Because the binary is trying to get that file capability into the thread capability. But the thread capability cannot acquire that capability, because the bounding set, it's clear. So the parent process cannot create processes with those capabilities, because itself doesn't have those capabilities in the bounding set. So can we bypass capabilities kernel checks? The answer is no. And with that, the demos around capabilities on containers are finished. We will see in the future how we can leverage capabilities on Kubernetes. Now that we are done with demo one, let's continue with secure computing. So what is secure computing or second? So applications usually require a small subset of the underlying operating system kernel APIs. For example, if we have an HTTP server, it won't require the mount syscall at all, because why would the web server need access to run the mount syscall? So in this case, why should they have access to this syscall? How can we limit that? So in order to limit the attack vector of a subverted process running in a container, the second Linux kernel feature can be used to limit which syscalls a process has access to. The previous example, we could say, okay, the HTTP process can only use this subset of syscalls that I know that are required for the HTTP server to run and nothing else. By default, Kubernetes wants everything as unconfined, which means that all syscalls are available to the containers processes. In Kubernetes 1.22, a new alpha feature will allow users to configure a default second profile that will be applied to all workloads otherwise specified, which means that you could have a default second profile that will be applied to workloads on your cluster. And then if you need something different, you need to provide a different second profile or run the workload as unconfined. In OpenSea, for example, we have a default second profile that will reduce the number of syscalls available for containers to use. But that profile is not applied by default to workloads. We need to ask to say, I want to use that profile. Okay, so the next question will be, how can I create my own second profiles? And the answer is that creating second profiles can be tedious. And it often requires deep knowledge of the application. For example, a developer must be aware that a framework that sets up a network server to accept connections will translate into calling socket, bind, and listen system calls. So as you can imagine, that's not something very common. But there are some tools that can help you to get the syscalls that are being done by an application. For example, you have tools such as OCI second BPF hook that we will use in our next demo that will give us the syscalls being used by a process. Or you can also use tools as S trace or BPF filters. There are many tools out there that you can get the syscalls being used by a process. Then there is another thing that you need to keep in mind. And is that if when you're creating a second profile for containers, for example, for Kubernetes, you need to keep in mind that you want to create that profile under the same container runtime that you are using in your cluster. For example, if you are using sysrun, you want to create the second profile using sysrun as a base. If you are running RunC, you want to create the second profile using RunC as a base. Okay, and now we have an example of a second profile. So we can see how it looks like. The first thing that we see is the default action. The default action will be the action that will be applied to syscalls that are not defined inside our profile. In this case, we are setting this to error, which means that syscalls that are not allowed in our profile will be denied. Then we have the architectures, which is the architectures that our second profile is targeting. And then we have the list of syscalls. We can have different groups and we can have groups of allowed or groups of denied. In this case, having groups of denied syscalls doesn't make much sense because we have them here disabled by default. But we could set up the default action to log, which means that if a syscall is not defined in our profile, but it gets used, it will be logged into a system that syscall was used. That will help us, for example, to improve our second profiles. Then as I was mentioning, we have the syscall section, where we have the names of the accepted syscalls that processes can use, and the action for those syscalls. In this case, we have four syscalls, and these four syscalls can be used. And as I was mentioning, we have three different actions that we can configure for our syscalls. The first one is allow the use of a specified syscalls. The second one denies the use of a specified syscalls. And the third one allows the use of the syscalls, but locks the ones that are not specifically permitted. And with that, we got to demo two, where we are going to see how we can use second profiles on containers. For the next demo, we are going to see how we can create our own second profiles using a tool, which is the OCI hook project. Okay, so first in first, we are going to install the hook. You will have the instructions on the demo files. And what we are doing now is one thing that is required is in order to run this hook, we need to run containers as root. So that's why I'm using sudo here. So we are running our container, and we are adding this annotation to the container, which is called iocontainers-syscalls. And then we are giving it an output file, in this case, tmpls.json. We are running this Fedora 32 image, and we are running an ls on the root folder. Then we don't want to get any output on that command. So we are redirecting it to that node. So let's see what happened. Okay, so the execution is complete now. So if we take a look at the ls.json file, you can see that in order to run that ls command, these are the syscalls that were used, all these ones. And the hook already created the second profile for us. So we can see that by default, any syscall that it's not defined in there will be blocked, and that we only allow these syscalls to be used by our container. So what we can do now is we can run our container with this second profile. So we are saying, okay, security options, second, we want to use this profile, which is the one that we just generated. And we're going to run same image, same command. Let's see if that works. Okay, it works. So what's the next thing that we want to do? Okay, let's try what happens if we modify this command a bit. So let's run an ls-l. Okay, if you look at this, it says, hey, cannot access root, operation not permitted. So that's probably because we are missing some syscalls. So what we are going to do now is we are going to run this hook. And we can do something really nice, which is, okay, now this is the input file, which is tmpls. And the output file, it's tmplsl. So what will happen is that all the other syscalls that we got from the first execution will be appended to the ones that we get now. This is really useful when you are testing your application. And for example, let's say that you have an application with two endpoints. And in the first execution, you just tested one endpoint. And now you need to test the second endpoint. If for some reason the different endpoints require of different syscalls, this is an easy way of merging all syscalls in a single file. So we are executing this command again. And now if we run a div between both files, we can see that for running the ls-l, we require these other syscalls that were not present in the previous profile. So what we are going to do now is using this profile, this new one, tmplsl.json to run this application again. And now we can see that the application got executed properly. And that finishes the demo around second profiles on containers. What we will see in a future demo is how we can leverage these profiles on Kubernetes. Okay, now that we know how we can use capabilities and second profiles in containers in our local system, now it's the time to see how we can do the same thing on Kubernetes. Capabilities in Kubernetes have some limitations at the moment when not using UID 0 on your workloads or what is the same when you're not running your containers with the user's route. That's because the ambient capabilities are not supported in Kubernetes, but this will be solved in a future Kubernetes enhancement. In this case, username space can help, but at this point, support for username spaces has made it to cryo, and it should be fully supported in Kubernetes in the next releases. Running containers with a fixed UID such as 0 has secure implications. The processes starting the same UID and with access to the same storage will be able to read and write the files from that storage. And also if you can break out of the container due to a vulnerability, probably in the runtime, you will be able to see processes and files on the host owned by that UID. In those cases, tools like Selenux will help reducing the attack surface in NU. While ambient capabilities make it to Kubernetes, we can use file capabilities or capability-aware programs. We will see that on the demo. On this slide, you can see the privilege escalation. There is a lot of text here, but you need to know that when you are using file capabilities or capability-aware programs on Kubernetes, you need those containers to be able to run privilege escalations, because you will be running a non-root process, and that non-root process needs to become a more privileged process. So that requires that privilege escalation. You have a lot more information around that in this slide. You can take it after the talk. And with that, we are going to see how we can configure capabilities on Kubernetes and the differences between running those containers on our local machine and running those containers in Kubernetes. Okay, for this demo, what we are going to do is we are going to see how we can leverage capabilities on Kubernetes. The first demo that we will be seeing is the difference between a pod running with UID 0 versus a pod running with non-root UID on Kubernetes. So first thing first, we create an MSPACE for our test. We are going to call it test capabilities. Okay, so we have our MSPACE created, and now we are going to create a pod. In this pod, as you can see, we are doing the name, reverse words, app, cap, test. So we are using the same image that we were using previously on the local test. We give it a name, and the container reverse words, as you can see, will be executed as UID 0. Okay, we run this thing. And then after some time, we will get the pod running, and we will be able to accept it. Let's wait a bit. Let's see how it goes. Okay, just the container creating. Let's wait for it. Okay, now it's running. So let's try again. Okay. And now you can see that the capabilities sets are a bit different from what we have seen on our local execution. Let's decode them. So in this case, we have less capabilities than the ones that we had in Podman. Why? Because this is an open-sift cluster, and it's using Cryo under the hood, and those are the default capabilities assigned to Cryo, assigned by Cryo. Okay, so now what we are going to do is we are going to run the same image, but with an on-road UID. So in this case, we are running the container as user 12024. Okay, so let's get the pods. Okay, it's running now. So what we are going to do is we are going to rep for capabilities. And here you can see the difference. In this case, since this container is running with a non-road UID, you can see that the permitted set, the effect and the effective set are clear. In this case, you can see that the ambient capabilities are also clear. That's normal because ambient capabilities are not supported by Kubernetes yet, so you will always get this capability set clear. So if you remember from the slides, now that we have the ambient set clear, that only leave us with two options. If we want to run an application with a non-road UID that requires capabilities, and those options are just file capabilities or program capability, our programs. Okay, so this finishes this first demo. And next, we are going to see how we can run an application with Netvine service on Kubernetes. Okay, for this first deployment, we are going to run our application with root UID and drop every runtime capability, but Netvine service. So let's create the deployment. So I will post here, as you can see, we are running this container as UID 0, and now we are going to drop the capabilities that are assigned to every container by the runtime automatically. So in this case, we want to drop all these capabilities. And now we want to add Netvine service. That way, our container will only have access to the Netvine service capability. Okay, and now let's get the logs for this application. As you can see, the application is listening on port 80. So that worked well. If we check the capability, um, sets for this application, let's see what we get. Okay, so as you can see, we get, we only got the Netvine service capability added to the different capability sets. That's expected. So what we're going to do now is we're going to drop all the runtime capabilities again. And on top of that, we're going to add Netvine service capability and request the application to run with a non-root UID. Okay, so let's do that. Again, I will post on capabilities. As you can see, different things here. We are running as user ID 124. And now we want to drop the one times default capabilities as we did before. And now we want to add Netvine service. So at this point, the only difference between the two containers is that the former one is running with UID 0 and this one will run with a non-root UID. Let's see what happens. If we get the logs, you can see that we got a permission denied. Okay, at this point, you might think why it's failing because we run this test on podman and it worked. But I will tell you something. This is in podman, it worked because the ambient capabilities, if you remember, that ambient capability was the one allowing us to get that capability on the other threat capability sets, like they're permitted and effective. And this is not happening in Kubernetes because as I mentioned earlier, ambient capabilities are not supported on Kubernetes yet. So what we are going to do is we are going to patch our application so we can access to it. We are changing the app port from 80 to 8080. And that will allow us to access the container. In this case, we want to get the capability sets. You can see that we have, in the inherited set, we have the Netvine service capability. And the same goes for the bounding set. So what are our options? Our options now are using capability, our programs, or use file capabilities. So let's do that. What we are going to do now is we are going to patch the deployment again. We are setting the port back to port 80. And we're changing the image by the one that has our binary with the Netvine service file capability configured. So let's do that. Okay, so that probably takes some time for the image to be pulled. So let me run get ports here. Okay, it's container creating. Let's wait a bit. Okay, the container is running now. So what we can do is we can get the logs. And you can see that now our container is listening on port 80. And if we check the threat capabilities for the process number one, which is the process for our application, now we see that the permitted and the effective set got this Netvine service assigned. So that is because when our binary was executed, the file capability was there and we had that capability in the bounding set. And that allowed the permitted and effective set to get that capability as well. And checking the capability on the file, on the binary file, you can see that we have this Netvine service capability on the reverse words binary. And with that we finish the demo around capabilities on Kubernetes. And now that we have seen how capabilities work on Kubernetes, we're going to see how we can use second profiles in Kubernetes. By default, Cubelet will try to find second profiles in the Barlib Cubelet second path. But this path can be configured in the Cubelet configuration file. You can have multiple second profiles in the same folder. And remember that Kubernetes runs everything as unconfined by default. That if you remember from previous slides, means that all syscalls are available for the processes to use. And in Kubernetes 1.22, as we mentioned earlier, a new alpha feature will allow users to configure a default second profile. And with that, let's see how we can use second profiles on Kubernetes. Okay, so in this demo, we are going to see how we can use second profiles on Kubernetes. So before we started with this demo, what I have done is I added a profile to my worker nodes. So I can, I can run this command here. And you will see how one of my workers nodes has this profile, then center s8.ls profile loaded on its Barlib Cubelet second folder, which is the default one, if you remember from these slides. So this ls profile, what allows us is to run the ls command, right, that we tested locally before. So what we're going to do now is we're going to create an MSPACE like we did before. And now we can configure the second profile at pod or container level. This time we're going to configure it at pod level, meaning that all containers within the pod will use this profile. So this pod, as you can see, is using security context second profile type localhost, because in this case, we are using a local profile. We could use default and we would get the default open sif second profile that as we mentioned in the slides, it restricts the number of syscalls that can be done. And this is a file that we are using, center s8.ls. We are running this image. And then we are running this command ls on the root folder. So let's see what happens. Okay, it's being created. So let's see. Okay, you can see it's completed. And that it was able to list the different folders. So what we're going to do now is we're going to run another pod. But in this case, we are running ls.l. If you remember, in podman this failed, let's see what happens in open sif. And also you can see that this time, instead of setting the second profile at pod level, we are setting it at container level. So we can have different containers using different profiles. Okay, so we created the pod. And now if we get the blocks, you can see that we get the same error that we got previously when we run this demo on podman. And with that, the demos around second are finished. Okay, so now that you have seen how you can use capabilities and second profiles on Kubernetes, the next question will be, okay, how can I manage capabilities and second on Kubernetes and open sif? In this presentation, we have been using capabilities and second without talking into account, authorization for these resources, and all that stuff. And as we have a world scenario, you want to receive which user have access to what capabilities or second profiles. In open sif, we'll have a security context constraint or SCCs that provides out of the box standard security defaults, support for complex ordering, and those are enabled by default. In Kubernetes, we have pod security policies that can be used for controlling this as well, but they have a few drawbacks, like there are no defaults, you need to define them. The default ordering is not that good, then it's still not enabled by default in Kubernetes, so you need to enable it on the API server. They will be deprecated in Kubernetes 1.21, and the community is working towards pod security admission that will be the PSP replacement, the pod security policies replacement. You have the Kubernetes enhancement proposal on the slides. And with that, the talk is over. You have more resources if you want on the links on the screen. You have access to the live demos that we have done on the last link on this slide. And with that, if you have any questions, feel free to send them over to marioagreja.com. And thank you again for joining us today. I hope you enjoyed the rest of DEV.COM US. Take care. Bye.