 My name is Urvashi Monani and I am a Senior Software Engineer on the OpenShift node team. Hey, I'm Sally. I'm a Software Engineer on the OpenShift workloads team. For the past few years, Urvashi and I, we've been getting together and giving container security talks. Every few months we'll get together with Dan Walsh and we'll talk about what's new. Dan's always presenting on security. But when there are enough cool features to talk about, we'll put them together and present them and we like to share them with as many people as we can. Okay, so before we dive into container security, let's talk about Goldilocks and the three bears. So just to recap the story, it's a story about a girl, Goldilocks, who ventures into the forest and finds an empty house. She's being adventurous as she is. She enters the house and sees there are three types of everything there. So there are three types of chairs, three types of holds of porages, three types of beds, etc. She decides to go sit on the first chair and realizes it's not comfortable. It's too hard. That was Papa Bear's chair. She then moves to the second chair and realizes it's the opposite. It's way too soft. It's still not comfortable. That was Mama Bear's chair. She then sits on the third chair and finds that it's just right, and that was Baby Bear's chair. So as we progress through the story, we see that Goldilocks always leans towards the just right option for the things in the house. She goes for the bowl of porridge that's neither too hot nor too cold. She goes for the bed that's neither too hard nor too soft, etc. So when it comes to container security, we can look at it with a similar lens. The first level is where we have all our security features enabled, but we realize we run into challenges when trying to run our applications. This is the Papa Bear model of container security. Too hard. And oftentimes when we hit such snacks, we end up overcompensating for it and disabling all our security features completely. And this time we're too soft. We're running in a very insecure fashion, and this is the Mama Bear model of container security too soft. And then there is that sweet spot where our containers are running in a somewhat secure fashion without compromising the usability of the applications. And this is the Baby Bear model, the Baby Bear model or the Goldilocks model of container security. Just right. Sometimes no one ever really turns up security. I'm sure a lot of you here have done stuff like hiding capabilities that you're not sure exactly what you need or running your containers with the privilege flag enabled. And the most famous one is running Saturn for zero where we end up disabling AC Linux completely. So in today's talk, we're going to show you technologies that we are working on to help end users to move more from a Mama Bear or Baby Bear security model towards a Papa Bear security model without compromising the usability of our applications. This way, we don't instinctively end up doing things like running Saturn for zero and making Dan Walsh week. He is the Papa Bear of container security. Please don't make Dan cry. That's what our talk is about. But first, let's just talk about container images. When you run a container. There are three inputs to a system. First, the OCI image format. When a developer crafts an image, they include things like the entry point, the user volumes, whatever goes in a Docker file. These get translated to a JSON file, the OCI image spec. Next, there's the container engine. Much of the security of a container comes from these hard coded values from the container engine, like which namespaces get spun up, C groups assigned, set comp rules, things like that. And the third input is from the user. Users can override and set aspects of running containers by passing flags to the run command, like volumes or capabilities of port forwarding. Those are the three inputs. And then the container engine, it will take its own hard coded default, the user inputs, and the information from the image, and it creates an OCI runtime spec. It's a JSON file that the runtime then uses to launch the container. The runtime is either run C or C run or cota containers. So that's how that all works. Now, the middle part, that container engine, like I mentioned, it mostly has been hard coded. But at Red Hat, we've worked to split the monolithic engine that was Docker with the Docker demon into four different functions with distinct tools. Cryo is for running locked down containers in production. Cryo is meant to run containers in Kubernetes. Podman then is for running locally while developing and experimenting. Podman, the CLI is pretty much one to one with the Docker CLI. And then Scopio is a tool just for moving images between registries. You can move an image from one remote registry to another without ever pulling that image down to your system. And then Builda. Builda is for building container images. So the idea is by setting security separately for each of these tools, you end up with the highest level of security rather than that least common denominator that you had with Docker. And also, without the Docker demon, these tools can run as brutalists. And Red Hat is constantly experimenting with ways to run containers most securely. For example, in OpenShift, we now run image builds inside of a container with Builda. And with Builda, there's no leaking of information from a Docker socket. This has made OpenShift image builds more secure. And also, we lead the way in using user names for running containers. We'll talk more about that later. But speaking of rootless, the most secure way to run any container is by setting the user inside the container to be non-root. This is a default in OpenShift. Regular users can't run as root inside their application containers. And in almost every case, if you think you need to be root inside your container, you're probably doing something wrong. For example, a few examples where you think you might need to be root is if you're running a web service. If you need to bind to port 80, you can instead use port forwarding from the host to run as non-root in the container. Another common reason for running as root is installing packages in your container. You should never have a package manager inside your container. You should install packages at image build time, like within a multi-stage Docker file or with Builda containers. These encourage minimal images that don't require root. However, there are containers that do need privilege. System containers or special containers whose purpose is to manage things on your host. There are such containers in OpenShift and in Kubernetes. So these, for these, there are many ways to secure them. You can use Linux capabilities, SecComp filters, SE Linux, username spaces. All of these can work together to lock down even your most vulnerable containers. And this is what we're here to talk about today. We're going to try to make it easy for you to run purely because we all know that when it comes to security, it's easier to disable a feature than it is to configure it. Chances are it's going to get disabled. All right. So one of the ways we currently enforce security is by limiting the power of root using capabilities. Capabilities are chunks of pseudo power. So each capability gives pseudo the right privileges it needs to carry out certain actions. For example, if I run a container and I disable all my capabilities in it, the root user inside the container won't be able to carry out privileged actions such as mounting a file system or changing the ownership of files and so on. Currently, we have 37 different Linux capabilities out of which we enable 14 by default when we run our container workloads. These 14 were originally defined by the upstream Docker project back in around 2013. But do we really know what these 14 capabilities are? The answer is no, but here is a list of the 14. After further examining what exactly these capabilities do, we found that a few of them are not entirely critical to running your container workloads. So the first one is audit write. The audit write gives you the ability to write certain information into the auditing subsystem. Back in the day when containers were starting off, people thought that the only real way of running jobs in your container was to SSH into it. And for that to be possible, you needed the SSH daemon running inside the container. Now, the SSH daemon won't run unless it had the audit write capability enabled. Now, we know that that's not entirely true. We have tools such as bondwinexec or dockerexec that let us do exactly that. And there are really no other applications or tools that need the SSH daemon running inside the container. So why have this enabled by default for all our container workloads? Second one is makenode. Makenode gives you the ability to create device nodes on the system. This can be pretty dangerous as it can be used to attack the kernel. And the main reason that this is enabled by default is that certain Ubuntu packages need to make device nodes when they're being installed. But if you have a different tool to build your container images such as builda and you have a different tool to run your containers in a production cluster such as cryo, you can run your containers more securely in your production environment by disabling this capability by default while having it enabled in your build tool without compromising the functionality. Next one is the strut. This just gives you the ability to to root inside a container. No real application really uses this. So we're not sure why this needs to be enabled by default for all the containers you run. The fourth one is netraw. And netraw gives you the ability to create any type of IP packets. This can also be dangerous because we can use it to format IP packets on the wire in a certain way that we trick the VPN to expose it to the external network. So it can be used to break out of the VPN stacks. And the main reason that netraw is enabled by default is that so that users have the ability to ping inside a container. We see that out of the 14, there are at least four that are not needed for every container that we run. And this kind of ties back to what I said earlier that no one really turns up security willingly because no one has really gone back to see why these capabilities were enabled in the first place about seven years ago. All right, we have some demos for you. So let's look at a demo where I can drop the netraw capability without compromising the ping ability in my container. So here I am running a basic image that has ping capability because my netraw capability is still enabled and ping works as expected. Now, using the tap drop flag, I'm going to drop the netraw capability. And as expected, it will not work because I can no longer ping in there. So if we want to drop this capability but still won't have ping, there is a way around it by enabling the sys control here. What the sys call is doing is that if your group ID falls in the range of zero to thousand, you get the ability to ping inside your container. So let's try that out. So I'm running a container here. I have dropped my netraw capability and I have enabled that sys call. And as you can see, ping works as expected. So as you saw, we said that we can further reduce the default capabilities to about a list of 10 for all our container workloads. But usually as an end user, it can be confusing as to what exact capabilities we need to run certain containers. The image developers are the people that best know what exact capabilities are needed to run the containers that they are building. So if I am an image developer and I know that my container only needs the setUID and setGID capabilities to function as expected, I can set this up as a label or an annotation in the image when I'm building it. And when my container engine, such as Podman launches this container, it will know to only start it with the setUID and setGID capabilities and not the default 14. And we have a demo for this as well. So here, this is a Docker file. I've built this Docker file. And here, as you can see, I set that label saying that my container only needs setUID and setGID capabilities. So when I run this container and use the Podman top commands to see what capabilities I have, you can see here that it is only the setGID and setUID capabilities. So now let's run the container image that doesn't have such a specification and we will see that it runs with all the 14 that are default. Now, what happens when an image developer says that the image needs capabilities that fall outside of this list of 14? So here I have a Docker file where I'm saying that I want my image to run with the net admin and sysadmin capability. So Podman will run your container, but it will log an error saying that you're not allowed to run these capabilities since they don't fall in the default 14, but it will run your container with the default 14. So you will not have net admin and sysadmin enabled, but you will have the default 14 list enabled. However, if you run the same image and you specify these capabilities using the cap add flag, you will see that Podman actually launches a container with the two capabilities that are still outside the default 14. So the reason that we do it this way is that we don't want users to end up pulling random images of the internet and running it. And when I get not realizing what capabilities it has enabled, like if it falls outside of the 14, it's usually enhanced capabilities that container workers don't usually need. So why are you running it like that? But if as an end user you really believe that you need those capabilities, then Podman will not stop you from doing that. So this was just a way of showing how we can further lock down our containers, move more towards the Papa Bear module by letting image developers restrict which capabilities are needed to run your containers. Great. So we showed how easy it is to set Linux capabilities. How about limiting sys calls? Well, processes communicate with the kernel through sys calls. So one way to attack a host is to gain access to the kernel through sys calls. Just by turning on seccomp in a kernel, you go from having eight or 900 sys calls down to about 450. Seccomp is a kernel feature. It was added in 2005. So just by turning that on, you're already better. When running containers, just about everybody runs with a default seccomp filter. This was developed upstream by Docker and by Jesse Frizel and a whitelist about 300 sys calls. You can find it on your system. They're at user share containers. That's better. But can we do even better? Recently, AquaSec did a study and they looked at all the containers out there and found that most only require 40 to 70 sys calls. And Red Hat, we've found the same to be true. The problem is it's really difficult to figure out which sys calls a container needs. They're not the same 40 to 70. Each container is different. So that was the problem that we looked at last summer. We had an intern on the runtime team, Divyaunch through Google Summer of Code. And he worked with Valentin Rothberg and Dan Walsh to come up with a tool to do just that. It will generate a seccomp profile based on a container that you give it. So the way it works, it's an OCI hook. It's by the OCI seccomp BPF hook. And an OCI hook is a helper program that gets launched by the runtime just after a container is created but before it starts. It hooks into the kernel through BPF and it watches all the sys calls on your system. It records those sys calls that are in a given container's PID namespace. And when that container exits, a seccomp profile is generated with the white list. A seccomp profile is just the JSON file. That's a white list of sys calls. All the recorded sys calls that the container used. So we'll show this running in a demo. It's pretty cool. But the idea is that an application developer can run this hook through their entire CI CD program. It can test every code pass and use case and edge case. It can run it in a test or a production environment for a few months. And just continuously watch the seccomp profile until it stabilizes and there are no sys calls being added. At that point, that developer can be pretty confident that this is the profile that should be used with this container. So let's see how it works here in this demo. Here's just a look at the hook itself. You can see it's in OCI hook C. That's where the binary is. And for any container that's launched with the annotation containers.trace this call, a seccomp profile will be generated and the user will pass a path to where they want that file generated. So now we can get out and... We'll just run a simple Fedora image and we're going to pass that annotation to tell the hook to start. And we just ran ls. So now we can look at that profile that was required just to run ls in a Fedora container. You see there are about... There's like 30 or so 40 sys call required just to run ls. So now we can turn around and run that container again using that generated seccomp profile. And you can see it works just fine. But behind the scenes, that container is completely locked down. Only those short lists of sys calls, 30 or 40, are allowed out of the 300. So great. What happens if we need to run ls-l? Let's check it out. So we'll use the same profile to run ls-l and it errors out because apparently dash l requires more sys calls. So you can see that in the audit log, hopefully. You can see some sys calls. It was trying to use it stopped. And there are a few listed, but there are probably more because that's just all that the audit log caught. So let's go back and run the container again with the annotation to catch any new sys calls that are required with ls-l. So here we have a new file generated and we can run with that new file and run ls-l and hopefully it'll work. Now let's look at the difference between the two second profiles, one for ls and one for ls-l. I found this interesting. The plus signs are the added sys calls that you need to run dash l. What I found interesting is connect. I was surprised that you needed connect to run just ls-l. What socket is being connected to here? Well, when you run ls-l, if you look at the output, your UID is mapped to your username. So instead of outputting for file ownership, instead of saying zero, it says root, or instead of saying 1,000, it will say somally. That action uses ns-switch in the background, ns-switch and the sss-demon. And that's where the connect sys call comes into play. Just an interesting tidbit. So that's how the hook works. You can download it and check it out for yourself. But to implement this, the runtime team has been working on a plan. And what they've come up with is that an image developer should ship the second profile within the image and include the label on the image that will tell the container engine, oh, this image has a profile. I'm going to use that. The reason is because, again, application developers know best how their container should run. And so it makes sense for the developer to include this with their image. So again, with capabilities, just like capabilities, say you have a sys call that's outside of the default. You might be met with a decision. The container engine might error out and tell you that it won't run this image because this sys call is outside the default. Or you can tell the engine, hey, I trust this image. Let's run it anyway. So those are some things being worked out. But the hook itself is in GitHub. Please download it, try it out. It works great. And that's it for the hook. Yep. So another way we enforce security currently is using AC Linux, which is a tool we all know and love. So the way AC Linux works is it's a security model based on type enforcement where files and processes have different types and access is restricted based on what type you can access. So in the past seven years, almost every CDE that has occurred has been file system breakouts. And guess what? AC Linux has blocked each and every one of them. So AC Linux is the best tool to protect your file system from container escape. It is sort of like the baby bear or Goldilocks model where it's in the middle. And it gives you all access to everything you need within your container. So you get access to all your capabilities to your network and all, but only inside the container. When you do try to break out of the container, it is blocked completely on the host. The problem with AC Linux occurs when we use volumes. When we are mounting volumes into container, we're essentially taking a part of the OS and exposing it into the containers. Since container processes are only allowed to read files that have the container file to label, a lot of the system directories on the host are not accessible because they don't have that label. So a way to make this work is to use the lower KSC or upper KSC mount option when you mount into your container. And what this does is it relabels the content on your host so your container processes can then access it. Now, this all works fine if the directory that you're mounting into your container is going to be solely owned and used by your container. But if you have other applications running on your host that need to access the same directory, they will end up breaking because they won't recognize the new label that this content has been re-labeled with. So the only way to make this work then would be to disable AC Linux confinement in your container, causing you to run in a very insecure fashion and pushing you all the way towards the Mama Bear model of security. So good for us is that we have a tool called Uditsa that helps us move less towards Mama Bear when using volumes. So Uditsa is a tool that creates custom AC Linux policies based on the configuration of your container. So the way this works is that you run a container with the volumes that you want mounted in and let Uditsa inspect it. And then Uditsa will look through your container configuration, analyze it and see what exactly it's trying to do. And then we'll create a custom AC Linux policy for you for that container. So we have a quick demo that we can look at how this works. So here I am running a container and I'm mounting in my home directory is read only and the vars pool directory is read write. So these are system directories which do not have the right label for container processes to read them. So as expected, when I try to read the home directory in my container I will get permission denied. And same thing with vars pool. And since I mounted this as read write, if I try to write to it, I wouldn't be able to because I haven't really able to be content. So now let's use Uditsa to generate a custom policy for this container. So I run my container with the volumes that I want mounted in and let Uditsa inspect it and create a AC Linux policy for me. So when you run this, what Uditsa does is basically creates a new label type for your container process and this is what gives you access to the content on your host. And then it's very simple. It tells you exactly what you need to run to load this new custom AC Linux policy. So as you can see down here, I have run that command and it takes a few seconds to load. So the great thing about Uditsa is that you don't need to be an AC Linux policy expert and as you don't expert to figure out what exactly customizations need to get this to work without relaping the content, it does everything automatically for you. So yeah, my new policy has loaded and I'm going to run the container again but this I'm going to set this new label here that I got from Uditsa and there you go, my container is running. So now if we look at the process running on our host machine, we will see that it is running with this new label as expected. So now when I exit into the container and I try to do exactly the same thing I was doing earlier to access my home directory, it works fine now. I no longer get permission denied. When I go to our school and do the same thing, I can read it. Now let me try writing to it and you can see here I am able to write to it. So Uditsa lets you it creates the custom policies for you that lets you volume out into a container without having to use relaping or without having to disable SC Linux confinement completely. So that's really cool. We've talked about Linux capabilities and SC Linux and .com filtering. Let's talk now about usernames. As I mentioned earlier Red Hat has been leading the way and driving forward usernamespaces. In Linux just a little background in namespace is what gives an isolated view of your system with regards to a set of resources. So for example in a container you're in a pid namespace you only have access to processes in that pid namespace. So within a usernamespace you only have access to the range of UIDs and GIDs that are in that usernamespace. This provides just an extra layer of isolation and a privileged root user inside the container to a non-privileged user on the host. So if a process was to break out of that container it wouldn't be a privileged user on the host. That's the idea and in fact UID separation has always been the standard security tool in Linux shared systems. So Podman they do some really cool things with usernamespaces. Usernamespaces are the reason why you can run these tools in rootless. And they're also really effective at providing separation you could imagine if you had a Kubernetes environment it would be a huge boost in security if every container was separated by usernamespaces. But sadly nobody is using usernamespaces for container separation yet. There's still some work still some issues to work out and again Red Hat has been and is leading the way with this work. One problem with Kubernetes is that there's still no support in Kubernetes for usernamespace. So UID shifting with volumes in Kubernetes it's difficult. It requires kernel support that isn't quite there yet. So when mounting a volume from a host to a pod the ownership of files is just not automatically updated and the work the community has been working on this for years and it is moving forward but it's a difficult problem. Also the churning is slow. So in Kubernetes you want files say are owned by UID 0. So when you're in a user namespace any files that are owned by UID 0 outside the namespace those show up as owned by nobody. It literally says nobody and that's what happens to all root owned files inside a user namespace. So you need to chone all those files and that is prohibitive because it's slow. But the container storage team led by Nalan and then the kernel storage team with Vivek they have been working and they added a new feature recently into OverlayFS to make choning and assigning file ownership in user namespaces much faster. So things are moving forward. Also Giuseppe Scrovano he's been working on a prototype in Kubernetes using user namespaces. If anyone can figure it out it's these three guys. Giuseppe also as an aside he rewrote Run C in the like over a weekend allowing containers to run with C groups V2. That's a different story but the run times team and Red Hat is working to move this forward and it's just taking some more time. So I do have a demo though that shows how user namespaces are really effective at separating containers and in podman this is easy. You can use this in podman no problem. So in this example we're mapping UID0 inside of a container to 100,000 on the host and doing that for the next 5,000 UIDs. And you can see with podman top that inside the container I am user root but on the host I'm just 100,000 and you can see PF on your system will show that those processes are owned by 100,000. Now I can run a second container and map UID0 inside the container to 200,000 on the host for the next 5,000 UIDs. And with podman top you can see now that on the host this is running as 200,000 inside the container on root and you can see with a PS that all the processes are owned by 200,000. So this show you can see now if a process was to break out of the first container it would be 100,000 on the host. It wouldn't have elevated privilege and it would have no access to the second container running in 200,000. It wouldn't even see the container storage is separated. So that's just an example of how podman uses user namespaces. All right, that is a really great tool but as we saw in the demo every time we run a container and we want to use user namespace support we have to set a specific UID map for each container we run. Now as an end user that can be pretty tedious to keep track of if you're running hundreds of containers to know which ranges you've already used and which are still available. So to make it easier on the end user and to help them move towards Papa very easily we have a new flag and podman run called userNS and when you set this to auto podman will automatically pick a different user namespace for every container that you run and it will guarantee the completeness. And we plan to there's still some issues that we're working around this and we plan to test it out completely in podman and once its feature complete and we're happy with its stability we will add a similar feature to cryo and eventually Kubernetes so yeah everyone using a Kubernetes workload can take advantage of it. And we have a demo to show how this works for the work we have done so far. So here I'm just going to run the container and set that flag to auto and when we look at the user that is running with on the host we see it's running with user one billion. Now if I run another container in a similar way and we see what user it's running on the host with it's running with a billion one thousand twenty four. So the reason it picks one thousand twenty four is because the default size that podman automatically picks is one thousand twenty four. But let's say you know your container needs a wider range so it needs a range of five thousand. So we can do that with the same flag just add a coolant size equals to whatever size you want. So here I wanted to have five thousand and when I run my container we'll see that it started off the UID from twenty forty eight which is ten twenty four later than the one that I run before this because that the range of that container before it was ten twenty four. But if I look at the map in this container we will see that the range here is set to five thousand. So this is still a work in progress this is how far we've gotten and we plan to add more to it definitely. Okay so finally the last thing we want to talk about is the containers.com this central file it's a feature being added in podman now. It's a central location where you can set security configurations system-wide for all of your containers and all of your containers tools. So for instance the distro might put the containers comp and user share containers an application developer would include a containers comp and that would go override the user share and go and add ccontainers and then a user could override that further and put the container so some things that you'd use containers comp for would be like removing those four capabilities that we talked about earlier the four capabilities that nobody really needs you could remove them system-wide for all of your containers and all of your tools in containers.com if you wanted to enable ping then you could add that syscall back in for all of your containers and you wouldn't have to remember that long command with the specific syscall that she showed in the demo so that's one way that you'd use containers comp another way is some of these commands like the build-up they have tons of flags you could have 10 or 20 flags and parameters that you need if a containers comp file contain those flags then that just makes it easier for a user to run that image same thing with high performance computing in very high security environments and there's a lot of configuration that's required with every container so adding this containers comp file just makes it easier for configuring you do have a demo with containers comp also so the the user name space equals auto and containers comp those will be hopefully available in the next release of podman so yep okay so let's just run fedora container and here are all the capabilities there's a 14 default capabilities but let's edit our containers comp file to just those 10 capabilities take out the four we talked about earlier that nobody needs so I'm going to pass this containers comp and variable to the podman command you won't have to do that this is just for the demo I showed you those three files in the beginning and podman will automatically pick those up but here running with our new containers comp you can see that there are only 10 enable now those four capabilities are gone now if we want to enable ping again we can go back into the containers comp file and we can add that syscall to the containers comp file rather than having to pass it to the run command so there now if I pass that containers comp file to podman you can see that ping will work and that will work for all containers on my system is the I think that yeah that's it that's the end of the demo and that's the end of our talk we do want to thank Duffie she did all the artwork on this slide anything else over she yeah and we we just hope that you can see how we're working on making it easier for end users to move towards a pop up security model with all the cool features that we're working on and that's it thank you