 Basically, this session aims to teach and educate about the basics of container security layers, which are C groups, namespaces, a paramour, secomp, and capabilities. This talk won't be a super technical talk, but we do want to, but I do want to speak about how do they work and about the strange cases when you need to run your applications, or containers, with additional privileges, and the solution for this, the general solution, the common one, is to run it as privileged. And this actually is really common, but it's not a secure use case, because running containers as privileged can pose a big risk. So we're going to talk about that and about what we can do in order to avoid such situations, but also grant the additional privileges to our container. So let's start with an introduction. My name is Aviv Sasan. As you can see in here, I'm a proud dog owner. I'm a principal security researcher with Prisma Cloud and Unit 42 at Palo Alto Networks. Mainly my job is to support the security efforts for the product, and in addition I'm part of our initiative to contribute to the security of the cloud ecosystem, and we're doing this by looking for vulnerabilities for zero days. In the ecosystem, we report them and make sure that they are patched. And this way customers are happy and secure. So let's start with the agenda for today. So I know this is ContainerCon, and I guess most of you have a decent knowledge about containers, but I still want everyone to be aligned. So we do a really, really quick brief about containers. Then we'll continue to containers limitations, what you can and cannot run inside your container. And then we'll go to the solution to run privileged container. And we'll talk about why this is not the best solution in most cases, for example in production environments. And then we'll talk about the better solution, what we should do. And the better solution is to configure strict security policies for all of the layers we're going to talk about today, which are capabilities, second, up-armor, C-groups, and namespaces. So in this way we can grant our container only the additional access privilege, instead of just writing it as privilege and give it all of the permissions that are possible. And after that we go to the takeaways, which will be more practical. So let's start with containers. So you all know containers, they are great, they are based, they are an operation system virtualization technology used to package applications and dependencies. You can put it in images, they're super portable. You can use Kubernetes for scalability, and they're really efficient, unlike virtual machines that have their own kernels. Containers have shared the same kernel as the host, and there's only one kernel for each host and all of the containers. So there's no overhead of many kernels like virtual machines. And containers also provide additional layer of security. So basically containers provide isolation. And I know you see all of the words in here, like namespaces and C-groups, and we'll dive deep into them later on the slide, so this is just a brief. And I just want to say that isolation is done by namespaces, and each container, as you know, is not aware of other containers on the host. So in case the container is compromised, then the attacker inside the container won't be able to expand his grip to the host or to other containers. Now other than isolation, containers have limited resources. And when I say resources, I mean CPUs and memory and stuff like this. So those limitations are done by C-groups. And the third one is that containers actually run with low privileges. And this is because if we run them with high privileges, then they will be able to escape the container and compromise all of the containers in the same host and the host itself. So the mechanisms that enforce those low privileges are capabilities, second, and a parmore. Now you might be asking yourself, so does containers have limited functionalities? And the answer is yes, in general, in the default policy. So for example, if you want to mount or load or unload kernel module, you'll get blocked because containers doesn't have that privilege. If you want to load or unload ebbf programs, if you want to do some IP tables operation or syslog operations, you won't be able to do so. And you might be saying to yourself, all right. So containers really aren't meant to do such heavy lifting operation. And that's actually true. But there are some cases when your containers, your applications, will need those privileges. So for example, if you are running a monitoring or security tool, so this container, this application would need a lot of privileges in order to inspect the environment. And if you want to run it, you'll get blocked by default. If you want to, for another example, run an advanced networking tools, again, you'll be blocked. If you want to direct hardware access, not possible. And for example, the strange case of Docker in Docker, which is actually running Docker daemon inside a Docker container, you also won't be able to do so. So let's talk about what we can do in such cases when we need our container to have additional privileges. And the first and simple solution is to run it in privilege mode. So for those of you who doesn't know what is privilege mode, it's basically a mode to run the container in, which disables all of the security layers. So basically, you have full privileges as the host of the same privileges as root on the host machine. There's still isolation over there because you are inside the container, but this isolation can be broken fairly easily. And the main risk in privilege container is that the container will break out and just to illustrate how to do so and the fact that it is really easy to do. So for one example, you can just mount the host file system. This is possible if you run a privilege container. After that, you can add a user to your host and then connect to SSH with that user and the password. You could also add an SSH can connect, or you can just create a cron job that will be executed as root on the host and the attackers will be able to execute any code that they want. And in addition, if the attackers even want to go ballistics on this one and they don't want to run as root, but they rather run in the context of the kernel, they can just load the kernel module and execute their code over there. Now, after the attackers are breaking out of the containers, they'll be able to access other containers and the host machine. And from that, depending on other policies from the orchestrator or DevOps, other configurations, the attackers might be able to do some other stuff. For example, take over the whole environment, interact with other nodes, and manipulate some data. So we do want to keep this boundary of containers in order to keep risk as minimum as possible. Now, the point of this talk is not to run privilege container, but run your containers with additional privileges. So in order to do so, because we have to, the best practice is to configure strict security policies that grant the minimum privileges required. Now I know this is kind of vague, and that's why we're going to dive deeper into what this actually means. Now the first thing we need to know, need to understand is what are actually those container security layers, we need to understand what they do, how they work, what are the policies, how they look like, and how can we modify them. And after we gain this knowledge, we can start working. And the first thing we need to know is, we need to determine the minimum required privileges for applications. So for example, if we have an advanced networking tool that we need to know that, for example, you need to have the CapNet admin capability. So we need to profile it in order to know what we want to grant. And then we just need to configure it and apply it to our container. Now this might sound easy, it's not that easy, and that's why I'm here to teach you how to do it. So let's start with container security layers, and what they are. Let's start first with capabilities, the first one. So in the past, when you ran a process, it could run in two modes. The first is privilege, which actually means that it runs in the context of route, and the second one is unprivileged. And this pose kind of a risk because, for example, if I'm running a process that only need to do some raw socket operations, there's no reason that it will be able to mount or load kernel modules or anything else. I just wanted to do some raw socket operations. And in the past, I couldn't grant my process this permission without just granting everything. So if it got compromised, I'm in big risk. And that's why in the present we have capabilities, which is basically a Linux kernel feature that takes the privilege, when I say privilege again, it means running a script, and they just divided all of the privileges of that mode into small byte pieces. And for example, we have a Capsis truth, and in order to allow your process to run truth, you don't need to run it as truth, you can just give it Capsis truth. And if it got compromised, the only thing that this process will be able to do is truth. So basically, it's just fine-grained privilege management, which is great. And of course, Docker and containers in general use this mechanism to enforce security. And as you can see in here, this is the default capabilities for Docker. As you can see, there are a lot of capabilities in here. But again, sorry, but there are many other capabilities that are not in here, and Docker did remove from the default policy. So let's talk about what we can do if we want to add a capability to our container. And this is the first one, so it's kind of easy. And Docker allow you to just use the flag cap add. So for example, you can use Docker run, cap add, cap net admin, and then give our container the additional capability that we require. And this is great. It works perfectly fine, but if you want to go ballistics on this one and enforce maximum security, then you could profile your application, understand what are the minimum required capabilities. And I say minimum required, I mean a subset of the default. So it's less than the default one, and of course, like for example, cap net admin, and then grant it to your container. And this way, you will maximize your security. For example, one example of how to do so is using the tool, the AgenProof, which actually is part of the up-armor suite. So this tool is used for profiling applications. It does it dynamically, and it yields up-armor profiles that includes capabilities. So for example, here you can see that I profiled ping, and the capability that I got from AgenProof is cap net row. Now I just want to emphasize the point that if you're going to profile your application, you need to do it in the right way. And you need to trigger all of the functionalities of your application, your container, so that AgenProof will be able to collect all of the privileges of the capabilities that you need. So for example, if we have a web server with an endpoint that mounts, then if we're not going to trigger these endpoints, AgenProof wouldn't know that your application needs the cap sysadmin, which is used for mounting. And when you'll deploy your application with the policy, then everything will break when someone access these endpoints. So this is something I wanted to emphasize. And if you want to use this methodology of work, what you should do is just run docker run, cap drop all, which means you drop all of the capabilities. And then you use cap add, which will add the capabilities that you'll need. And this way, you'll have the minimum required capabilities. All right, second one, let's go over to Secomp. Now, Secomp is a secure computing mode. It's also a Linux feature, Linux kernel feature. And basically what it does is it blocks the execution of syscall. So when you execute the syscall before the execution itself, it goes through the Secomp mechanism in the kernel. And Secomp decide if the process is allowed to execute the syscall or if whether it's not allowed to execute the syscall. And this one is a great security layer because other than blocking some syscalls that you should not be able to use, it also helped minimize attack vector in other ways. So for example, if there are some kernel vulnerabilities in some syscalls that you're not aware of, then it is possible that Secomp will block you from using those syscalls by default because you don't need to use them. And then attackers won't be able to exploit them. And another point I want to emphasize is that it disabled by default the unshare and setNS syscall, which we're not going to dive deep into this one, but they allow you to change your process namespace. And if you change to another username space, then you'll be able to gain all of the capabilities you want. So in this way, you can bypass the capabilities layer. But if you have Secomp and disable unshare and setNS, then you won't be able to do so. So let's go over to Docker default second profile. So as you can see, it's a whitelist for syscalls. Here you can see a part of the profile. And Docker say that they disabled around 44 syscalls out of 300. If you want to go deeper on this one, you can check the link down below and see the default profile. So that's how it looks like. And you might be saying to yourself, all right, so I ran some containers. That's what I do. And I use the cap add, for example, and put capsys truth in there. And Secomp didn't block me. I was able to do so. And the reason for that is that Docker default second profile is more sophisticated than this. And it has some paragraphs, some sections over there that's saying, for example, on this one, if the container starts with capsys truth, then we're going to allow truth. So basically, if you run your container in Docker and use cap add, then you'll be able to bypass the capability security layer and the Secomp security layer and execute those syscalls that are relevant to the capability. But the default configuration is great. It's really good. But there are some cases when it could be a little bit more permissive. So for example, if we start our container and give it capsys admin, then we'll be able to execute all of those syscalls in here. And there are more syscalls that are not in this presentation because I haven't had enough space to put it on. But if I want to, for example, just be able to run sethostname, then I'll grant capsys admin. And other than be able to sethostname, I'll be able to do anything I want, for example, setns. So we talked about this. This is not a good one. So this is kind of permissive and can lead to some security risks. So let's talk about what we can do and what is the recommendation. And the recommendation is that if you have an application that needs additional privileges, you need to just give it the minimum required access permission. So the recommendation, the easy one, is to just take the Docker default second profile and remove the excessive syscalls. So for example, if we want to use sethostname, then we should just take the default second and remove all of those syscalls over here. And then when we'll run capsys admin, then we have the capability, but we will only be able to run sethostname and not stuff like cellness. All right. And again, if you want to go ballistics on this one for maximum security, then you should profile your application and create a custom profile. One example of profiling could be just run srace-c. So in this example, you can see that I ran srace-cls. And I got all of the syscalls that my application needed, as in this case. And again, I want to emphasize this. If you want to profile your application, you need to make sure that you trigger all of the functionalities so that the policy will include everything you need and not just some part of it. So when you put it in production, it will be able to operate and not break. Now, if you want to use your custom second profile, you could do it by just executing docker-run, security op secomp, and then the path to your profile. All right, let's go to the third layer, which is uparmor. So uparmor is a Linux kernel security module. It's a great module, and it allows you to do a lot of things in restrict operations in a lot of ways. So for example, it allows you to restrict capabilities, file access, networking, and much, much more. In essence, basically, it is a whitelist profile. But if you take a look at the default docker uparmor profile, you'll see a blacklist and not a whitelist. And the reason for that is that in the beginning of that profile, docker allows you to do a lot of things. And after that, it just excludes the more dangerous stuff. So you'll be able to execute a lot of functionalities, but not be able to execute dangerous one. So if you want to go and check out the profile, you can go and see the module project over there in the link down below and see the full uparmor profile that comes by default. Now let's talk about recommendation. So the recommendation is to configure an uparmor profile for each container that need additional privileges with only the required access privileges. I know I said it a lot of time, the minimum required access privileges. And it's because there have been so many talkings about privilege container. But still, until this day, people are running the container as privilege. And these pose so many risks, as we talked about before. So that's why I'm emphasizing this point. So the easy way, the recommendation after how to do so is just take the default docker uparmor profile and just add the access requirements. That's how you can do it more easily. And if you want to go full ballistics on this one, you can profile your application and run it with the policy that you generated. And this can be done by AgenProof. GenProof, as we talked about in capabilities. So this actually is a purpose of AgenProof. It's a great tool. It just yield your uparmor policy for your application. And again, you need to trigger everything in order to get the full profile of your application. And once you do, you can use it by running docker run, security opt, uparmor, and then your profile. All right, so fourth one, let's talk about C-groups. So C-groups also a Linux kernel feature. It allows you to allocate, limit, and prioritize resource. Ritual says such as usage, memory, networking, GPU, and stuff like this. And basically, it helps you to ensure that each container has its own fair share of resources. And security-wise, it ensures that one container does not exhaust all of the systems. So for example, if an attacker compromised one of my containers and deploy a crypto miner over there that takes all of the CPU and memory, then other containers won't be able to operate. And other critical processes on the host machine also won't be able to operate because of that. So that's what C-groups is for security-wise. On docker, there is no CPU or memory constraints for C-groups. I know that some of you may have ran your containers and got out-of-memory exceptions. And I just want to emphasize that this is not from C-groups. This actually comes from the kernel saying that it cannot execute some vital processes. So it just kills your container and frees some memory. OK, let's go to a recommendation for C-groups. So this one is really straightforward. The recommendation is to set limits according to the required usage. Depending on your application, you all know what your application do. And that's what you do. So you can do it by using some flags provided by docker. If you want to have some more information, you can go and check the link down below. And here are two examples. If, for example, you want to limit your memory to 300 megabytes, you can do it by writing memory equal 300 m. And if you want to limit it to only use two CPUs, for example, you can just use the CPU's flag. OK, so last one, namespaces, and my favorite. So namespaces also Linux kernel feature that provide processes with their own system view. So basically, namespaces are the essence, the feature that allow the isolation in containers. And as you can see, there are a lot of namespaces. And let me explain what is actually a namespace. So for example, let's take the net namespace. So the net namespace is a namespace for network interfaces. And that actually means that if I have a container with a namespace A, net namespace A, and I have container in net namespace B, then one can have a network interface. And the other can have another network interface. But when you look inside your environment, each container won't be able to see the other one. They couldn't see the other network interface. So that's great isolation. And if we talk about pin namespace, it's actually for processes. So if I have one container here and one container on another namespace, then if they run stuff like PS, they won't be able to see the processes on other containers. And that's the basic for the isolation. Other than both of them, we have the IPC namespace, which is for inter-process communications, stuff like semaphores. We have the MNT namespace, which is for mounts. We have the UTS namespace, which is for hostname and for domains. And we have the username space, which is for users. Now let's talk about the Docker default namespaces. So when you start container in Docker, the container engine will start the container. And it will create five namespaces, speed, net, IPC, MNT, and UTS, for your container, and then will move your container into those namespaces. So each container actually run in different five namespaces by default. And that's why containers can't see each other. Now you might see that one namespace is not over here, which is the user namespace. And the reason the user namespace is not used by default in Docker, because this one is kind of tricky. It's an advanced, more advanced feature. And it could cause some complications when running containers. Because when you run inside another user namespace, you'll see yourself as root, but you won't actually be root. So for example, if I have a mount on my container, and there are some files over there that need root permission, and I try to read them, then although I am root on my namespace, I won't be able to access those files because I'm not actually root. So that's one of the issues over there. Now let's talk about the recommendations. So the recommendation, Docker did a great job. And the recommendation is just to run your containers with a default setup. It works great. If you need to run your container with some namespaces with the host, I can't figure why you should do it. But I would recommend just to avoid it. Don't do it. Just use the default setup. All right. And so let's go to the better practice. So the better practice, as you might imagine, will be to run with user namespace. If you want to know how to do so, you can check the link down below. And before you're doing so, I would ask you to make sure that everything works before putting it in production because sometimes it can cause some issues. It is a great feature, but it could cause some issues. And environments can just go down on this one. So let's go to the takeaways. Let's see things I want you to take from this talk. So the first one is that some containers need more privileges. It happens, and they need to do so, and it's OK. But if you run them as privileged, there is a big risk over there because they can just, if they get compromised, attackers would be able to escape and compromise other parts of your environment. So in order to grant your container the relevant privilege, without trying it as a privilege container, you should just configure strict security policies for all of those layers that we talked about. They grant the minimum required privileges. And in this way, you'll reduce the risk in case the container is compromised. Now, some more practical advices for developers who publish their applications as container applications. You should, the advice is to profile your application at a second and up our profiles to your repo so that your customers, your clients, will be able to use those profiles instead of running your container as privileged. And if you are running your applications as Docker, then you should provide a full detailed command line that uses all of the policies we talked about, including the second and up our more profiles. And if you're using Kubernetes, you should modify your YAML in order for you to enforce all of those policies and just add them to your repo so that customers and client would be able to use those configuration in a simple and easy way. And that's about it. So I want to thank you all for coming to this session. You can find me on this email and on LinkedIn. And if you have any questions, I'll be happy to reply. Thanks. Thank you, everyone.