 And I think it's been a while. We can go ahead and start the presentation. Okay, so we're going to be talking today about sort of the state of container security. I did this talk at the DEF CONF check, and this will be a little bit of an update on it, but I think it's pretty interesting, some ideas that we've been going over. So at DEF CONF check, I actually found that a lot of people in Europe don't know who Goldilocks is, don't know the story of Goldilocks, we're not free bears, but so I had to run a video there. Hopefully since this is the US one, people will understand it, but basically the basic idea of the story is that Goldilocks goes into a house with three bears, there's a mama bear, a papa bear, and a baby bear. And basically she tries to eat some porridge, the papa bear is too hot, the mama bear is too cold, and the baby bear is just right, sits in a chair, the papa bear is too hot, and the mama bear is too soft, and the baby bear is just right. And that's basically the idea because everything she goes through, she's always picking the just right moment of the story. And so a lot of times when we look at container security, if we make container security, or make any type of security too high, people just turn the security off. So if we crank up security so high, then you can't get your application to work, then you're just gonna turn it off. And then if we make security too soft, basically what's the purpose? If all we're doing is putting chicken wire around the prison, it's not really doing anything to prevent a breakout. So what we wanna get to when we're looking at container security, we're always sort of taking that middle ground, we're always trying to get to the point of being just right for security. And that's really what Goldilocks is all about. So when I look at container security, I realize that no one ever turns security up. They basically, people never run the commands, part man run capability, remove, there's a typo on the slide. But basically they never say, take away some privileges from a container. What they usually do is things like run privileged. So if I'm running a container and it doesn't work, they're just gonna run part man and run privileged and the container works and then they're happy. And even though they've turned off all the security. And then of course, historically with SC Linux, the joke is always, how do you handle SC Linux? Well, you're setting four zero, which basically moves everybody back to the moment of their moment. And so bottom line with container security is people take the defaults or they turn it off. So anytime people come into the equation, those are really the two choices we do. So because of that, again, we go with the medium level, we get the Goldilocks, the middle ground. And so the rest of this talk is gonna be talking about, how do I take, move from Goldilocks towards Papa Bear? How do I get users to run more securely and without the users having to do anything because again, if the users have to do anything, all they're gonna do is turn things off. How do I get to more security without the user taking action? So one of the things we look at is sort of forcing the users to be more secure. So when it comes to running a container, when you wanna run a container in the environment, usually there's three different entities that have input into running the container. And the first one we're gonna look at is actually the user. So basically someone sitting, running a podman command or a Docker command are launching a Kubernetes or an OpenShift container, they're gonna launch a container, they're gonna write a YAML file to launch the container. Well, one of the things that we've done in OpenShift is we've sort of changed the default. If you run, if anybody runs containers in podman or Docker, they usually default to running as root, but OpenShift basically out of the box forces everybody to run as non-root and basically makes you have to jump through hoops to basically run a root running container, which sort of forces good practice into the users, but we have to also, the developers have to develop the containers so that we just work in non-root environments. So what we're trying to do is we're trying to use a little stick towards the developers towards everybody that's building container images to move them to the ability to run as non-root. And that to me is one of the key factors in getting containers to be more secure in your environment to just get rid of root altogether. So I wrote an article back in 2018, but basically said just say no to root inside of containers. And I really talked a lot about it. A lot of the security features we have to add to containers is all about getting rid of root, getting rid of this powerful root user in the system. And really containers, just about all the workloads don't need root. Theoretically, the only reason you need root is to modify the system. And really we don't want containers running the system, right? Your web services, your databases, they all run perfectly fine without being root in the system. It's just that it's really because of the way we build container images and the way we traditionally install software is we make an assumption that the system, when a machine boots up, the process that the Apache server is going to get started by root and then it'll drop privileges. In a container world though, we really want to start applications right off the bat without privileges. So it's a little different mindset from sort of traditional way we install software. So the next entity when we want to run a container on a system is the container engine, okay? And the container engine that everybody is familiar with is Docker. And but Docker is sort of a tool for all entities. So if you think about what provides the default, SC Linux label, what provides the default list of capabilities, what provides the default list of set cops, that's all in the container engine. Well, I've always had a big problem with the Docker and it runs this root, right? And it basically, you know, it's a big demon that anybody interacting with it has to, you know, connect over a socket and be able to launch the containers. Well, you know, I like to differentiate when I look at containers, containers underneath Kubernetes is containers in production, right? So when I run my containers in production, I really want to write them a lot tighter, security-wise than maybe when I'm developing containers. So if I'm just playing around with containers, you know, or even when I'm building containers, I need different privileges than when I'm, you know, running containers in production. But if we only have one container engine and that container engine has, you know, it's hard coded defaults inside of it, then all of a sudden all different types of workloads end up being a standard, you know, following the same rules. So running containers in production, building containers or playing with containers all have the same default rules from a security point of view. And I'd really like to at least take the ones running my production and move them closer to Papa Bear, right? So one of the things we've done over the last couple of years is we took what was the Darker Demon and looked at the functionality that it did and we broke it into a series of other container engines. So we have Cryo as a container engine whose main job, his only job is to run Kubernetes. And that's running containers in production. And because of that, we ship Cryo with tighter security then we would ship, you know, say the Darker Demon or even Podman and the other tools. Podman was a tool to, you know, to implement pretty much the Darker command line tool originally and our goal was to make, you know, running containers, switching from Docker to something else as easy as possible. So we really just copied the entire CLI. But we use some new concepts and what I felt was a better security model of using sort of standard client server model instead of using a client server model to use the fork exec model so that, you know, Podman is the parent of the container that it's launching. And then Builder is a, is lastly is a tool for building container images, sports things like Docker files, but actually also allows you to just create a directory on disk and put content into it and really allow you to really get deep down and controlling on what's going inside your container images. But the really interesting thing that happened when we built these container engines is people started experimenting and adding features to actually allow them all to run without requiring room, not trial, but to Podman and Builder. And we started to see that since we had sort of smaller subsets, we could start combining them back up. So imagine running a Kubernetes cluster and you take a bunch of builders and stick them into them and now you can have your entire CI-CD system running inside of Kubernetes with, you know, lockdown Builder containers locked down by the Kubernetes environment. If you try to do that with Docker, what people do is they inject the Docker socket into the Kubernetes containers and then, you know, allow them to talk to the Docker socket. And years ago I wrote an article that said access to the Docker socket is the most dangerous thing you can do on a computer system because you can basically become root, take over the machine and then eliminate all logging that you did it. So anyway, so that's one way to get more securities by breaking apart, you know, each of the container operations into each container engine and then the container engines can have different defaults. So the last step is basically the person, the engineer, the developer who's developing the application that you're gonna run inside of container. And that's the person that's building, you know, all CI images, the images that's set out at Docker IO and Builder IO. And most of the rest of this talk is gonna be talk about some of my ideas about how we could do a better job, you know, getting more input from the developer into saying my container will run perfectly fine with these tighter security controls and then, you know, give us an opportunity because right now there's no way for the developer to actually give input into what security controls that his application needs in the system. It just hasn't been built into the OCI at this point. So let's look at a couple of things. So we talked earlier about, you know, running containers and the problem of running as root. And, you know, but sometimes people have to run this route and over time, we've looked at, you know, there's a concept of Linux kernel called capabilities. And what capabilities are is basically, you know, we took the power of the Linux kernel engineers took the power of root and divided them into originally 32 capabilities and later on to 64 capabilities. And right now, actually there's 37 capabilities as last time I looked anyways. So there's 37 different parts, sections of running as root. And surprisingly, we can run most containers with only 14 capabilities. So only 14 capabilities, that's sort of what, you know, over the years is involved as to the list of capabilities. But if I asked everybody on this, if I gave everybody online right now a test and said, what are those 14 capabilities? I would bet you almost everybody would fail to answer all 14 of them. Even I would fail to list all 14 of them. So again, they were designed by the Docker project and do you know what they are? Well, here they are. These are the 14 capabilities that are allowed by default by Docker, by Podman, by Builder. So these are sort of the default list of capabilities. But they've sort of evolved, right? They, you know, wasn't written in stone that these are necessary. And so they just sort of evolved. And I have four of them I'm gonna talk about right now that I really don't think should be allowed by default. But again, they were allowed by default from the beginning and I'm trying to really get it turned off. So AuditWrite is the first one I got to talk about. AuditWrite was a capability that allows you to basically write messages and write to the auditing subsystem. So you can write messages like Dan Walsh just logged on to the system. So you wouldn't think that you'd want a container to have that capability, right? To be able to modify something. AuditLog is supposed to be this high security system for logging what's going on in the system and to have a containerized process have out of the box have ability to write it. So why is that on by default? Well, historically, that's on by default because when people first started running containers using containers, one of the first things they did is they put the SSH deamon inside of them. So everybody thought they were like VMs and you put the SSH deamon and then you'd SSH into your container. And obviously that's not what you need. You just do a podman exec and you get into the container. You don't have to use a deamon to get into it. But because people did that, everybody was constantly failing when they were running Docker with SSH deamon listening. And it was always blowing up and was blowing up because they didn't have cap audit right. So when you log into a system there's an audit record written the facts that you, you know that Dan Walsh logged on to the system. So SSH deamon was blowing up and out of the box we allowed audit right or you know, the powers to be the upstream community decided to just allow audit right. Why is it still on at this point? Nobody runs SSH deamon inside of it but it's real hard to go back and remove certain things. The next one I'm going to talk about is make node and make node was allowed you to create device nodes. Now when a container, you know the container engine goes out and creates all the device nodes that it wants you to use. So it creates, you know basically all the content that's in slash deb and only creates devices that, you know your figures that you need or you know, it's not putting in, you know physical devices, you know physical disk devices, things like that. And then you as a, you know the users can add additional devices if it wants. Now make node is basically allows you to do a make node and it can, you know which is quite basically a real privilege. So it's kind of a dangerous capability because, you know, directly you create device nodes that you could use then to attack the Linux kernel. Now there are other features of Linux that prevent certain, you know we have C groups controls and at least in V2 or V1 version and then we have some EVPF to prevent to control so the which devices you're able to make but for the most part you don't need device, you know you don't need make node to do anything. So why is make node on? Well, it turns out that certain images certain builds in the world, you know need make node to be able to build containers. So they create device nodes on the fly, you know during installation. Now, I mean, this is all something I heard about years and years ago but, you know but again, if you're running containers inside a cryo you know, you're probably not going to be doing that. So or you at least should not have it on by default and just to have your builder containers get it by default. So the other one, you know I have a problem with is CIS cheroot and we go back and forth on this so cheroot is a mechanism for creating a cheroot environment but if you're already in a container you're really sort of like in a cheroot on steroids but we're experimenting back and forth turns out some RPM builds need cheroot so I'm not sure we're gonna be able to get rid of this one but that's another one it's just like it just seems kind of shouldn't be on by default and then the last one probably the worst one is NetRaw matter of fact we regularly hear about CVEs dealing with NetRaw NetRaw allows you to create a ICNP packet basically allows you to create any kind of IP packet in the universe and NetRaw has been used to break out of secure virtualization so basically if you get assigned a VPN a virtual private network and don't get access to the host network there's been occasions when people have been able to format certain types of packets send them out on the VPN and they have those packets somehow break out of the VPN and get out onto the rail network or whatever so having NetRaw on by default is really curious but why do we allow that by default and it turns out the reason we're allowing it by default is for the PIN command so people when they set up containers they wanna go into the container and be able to make sure it can reach certain networks and the way most admins or developers do that is they use the PIN command and PIN creates an ICNP packet which is not allowed by default and requires NetRaw. Well, what we've done in containers is we actually there is another tool and Linux assist control that could be turned on to allow non-root users and non users without NetRaw to actually do PINs without requiring without requiring a raw capability. So what I'm gonna show you right now is what happens when what I'm gonna show right now is a little demonstration of what happens when you take away NetRaw capability from this is probably why I should record this. Okay, so here we have sort of a standard pardon me in running a container and it's just doing a PIN, right? So out of the box, as I said we have default capabilities for NetRaw and that allows you to PIN. So the next step is I'm gonna run a container now in this case I'm gonna drop NetRaw and boom it blows up. So just, you know, I have a PIN committee or basically a container image that tries to do a PIN we disable NetRaw and also PIN is broken. But there is assist control that's been available for I think about seven or eight years in the Linux kernel called net IPv4 PIN user range. And what this allows you is to for a range of UIDs on your system to allow anybody that's in that group to be able to do a PIN. So what I wanna do is I'm gonna do a drop the NetRaw capability again and all I'm gonna do is set the assist control and boom I get capability. So now I'm running a container that's much more secure and that that assist control that I just said is actually set just for the container. So it hasn't loosened up the security on the rest of the system. It doesn't know theoretically if you're worried about other users on the system being able to PIN or other containers being able to PIN but it's a fairly easy thing to change and instantaneously you get a little more secure. So that's kind of cool that we could do that and just as an idea of different ways we could actually move towards PapaBear. So now we're gonna look at imagine a developer has figured out in his image that he wants certain he figures out that his container image only needs a couple of capabilities. Say it only needs set UID and set GID. So what would be really cool is if they could add a label something some content to the image that says my image only needs these two capabilities and what could happen then is the container engines like Podman and Cario could actually pull down the container image and take that label and read that label and figure out if those two capabilities are currently allowed in the default capabilities. So what's gonna happen is if the image says I only need set UID and set GID, Podman goes looks at his default list and says oh I already allow set UID and set GID instead of allowing all 14 capabilities then it could just say okay I'm gonna launch the container just with the two capabilities it needs. Now if the image came along and said it needed a capability that isn't in the default list then it could at least currently in Podman what happens is it'll blow up and basically but it'll tell you that this container in order to run needs this capability. So let's take a look at that. So here we have I'm creating a container basically just a Fedora container that's gonna have says that it needs set UID and set GID and I'm gonna create a container and basically with Podman and because it pulled down the Fedora capability, it created that container it basically launched the container and shows that it's running with the default capabilities to set UID and set GID. If I run a standard container basically a regular container on the system at default container you see that Podman basically will run with the 14 capabilities that we've been talking about during this talk. So now I'm gonna create another container image and this time I'm gonna use ones that aren't in the list of the 14. I'm gonna run Podman and what happens now is Podman will come out and actually create the container with the default 14 list but it'll actually report an error to you saying that the capabilities requested by the user or an image are not allowed by default. CapNet, Admin, and CapSysAdmin. So you see we still go back to the 14 if you ask for ones that aren't in the list but if you decide that you want to run the container with those capabilities because it doesn't work for whatever reason you can actually go back and run the container again and just add that to your fault list. So there's the commands to add those capabilities. So this is a mechanism for the developer to at least communicate back to the user that hey my container needs these two capabilities to run properly and that's documented nowhere in the system. I mean, so all of a sudden, his container will run perfectly well as long as it has those two capabilities. So it gives a mechanism for the developer to actually document in this container the way that he wants the containers to run which capabilities they want to run. So similar to that, another feature that we use for securing the containers is a thing called SecCop and really what SecCop is all about is controlling the number of syscalls that are available to containers. So it limits the, right now there's about 600 syscalls on the x86 system and when we use SecCop we're able to limit the filter, the number of syscalls that a container can do. So right now, well, let's just say that there's 450 on a standard Linux box. I think it's actually a lot more than that if you include the 32-bit code. But so right now, we have a SecCop filter that basically identifies which syscalls are available to the system when we run containers. And what happens is we're allowing about 300 Linux syscalls out of approximately 450 plus the 32-bit syscalls. So we eliminate the 32-bit syscalls and then we eliminate about 150 other syscalls when we run containers. That's pretty good, but 300 syscalls is quite a bit, right? So can we do better? So Aquasec wrote an article back in 2019 that said that they believed that most containers can run with only 40 to 70 syscalls. So right now we're giving 300 syscalls when we potentially could only need far less than 100 syscalls to be able to run. Wouldn't it be good to eliminate those additional syscalls? But of course, there's no way for anybody to figure out which syscalls the container image is actually mixed. So last summer, a year ago, summer we actually worked with a Google Summer Accode and we created a new open-source project called the OCI Seccom BPF hook. And what this hook does is actually plugs into containers and actually just watches which syscalls are being used inside of a container. And then it'll generate a Seccom profile file by tracing all the syscalls and then you can use that later to lock down your container. So this gives you a quick idea of, this is a configuration file for the Seccom hook. And what this is basically the interesting thing here on the screen is that you can basically set this up to permanently sit on your system, just install on your system. But then if you want to trace syscalls you can set what's called an annotation. It's basically a command line option to podman or to cryo, you can put it in your Kubernetes YAML to say basically I want this actual container image or this pod to be traced to basically have all the syscalls recorded inside of it. And what I'm going to do now is I'm going to actually use podman and I'm going to run the annotations call to basically say I want to run Fedora LS of slash basically just looking at LS of slash and I'm telling the syscall filter to record all of the Seccom syscalls that are happening there. And there it just ran the command, traced all of the syscalls. And it created, if you look up here you'll see that I'm telling it to store them all into a file called myseccom.json. And now if I look at myseccom.json that shows you all of the syscalls that LS of slash uses when it wants to run a container. So there you see, you know, whatever. So it's about probably about 25 or 30 syscalls here. So now what I'm going to do is now I'm going to put it in enforcing mode. So instead of you, now instead of using the filter I'm actually going to use the generated seccom file for the container. Oh, by the way, I didn't mention earlier in my talk that these seccom rules for those default 300 syscalls were actually developed by Docker actually by Jesse Frizel was the one who led it. And, you know, it just went back and forth and they basically try to find out what the real Goldilocks you know, what will the bulk of all containers run with on the system. So it was a real Goldilocks moment. But here we have a real pop of beer and that I can run the container with just the syscalls that the container needs to be able to run. So obviously that's a much more secure way of running a container. If it, you know, something hacks into my system and causes a different syscall to be used it'll be blocked by the kernel. So here I'm going to run the exact same command but this time I'm asking adding the dash L the dash long flag to the LS command and you're going to see that the container is going to blow up. So that shows that the container actually, you know was blocked seccom to try to do some syscalls that weren't allowed by the default. I'm not sure why my auditing system is not logging it right now but there should be records in the auditing subsystem to say that the syscalls, you know, are used but I'll show you them in a minute. So what we're going to do now is I'm going to use the annotation again. So I put it out of enforcing mode sort of been back into a permissive mode and this time it's going to take the original file. So an input file of myseccomp is going to create my and then going to generate a new file called myseccomp2.json. So basically it's just sort of, you know taking as input what was allowed before and then getting the new ones. And now I'm just running the LS dash L command to get the long listing of slash and there you see a long listing of slash inside of my container. And now if I run the same command and lockdown mode with the new file it works instantaneously. Now I have a broader sense of syscalls that are available to the container to be able to run. And if we want to look at what did dash L add why did dash L not work? And it basically didn't work because it needed these syscalls. So if you understand the way Linux works on the system some of these syscalls I can explain to you. So when I do in that dash L that tells it to basically instead of showing UIDs to you it actually needs to translate the UID into something like DeWalsh on the system. The way that happens underneath in the Linux system is basically ordinarily just reads Etsy password and translates it. But on a modern Linux systems that's a little bit more complicated than that. Actually the LS command is actually using a feature in G-Lib C called NSSwitch. And then the switch is set up by default instead of reading the Etsy password file directly it actually goes out to a demon it connects over a Unix domain socket to a demon called SSSDemon. So it needs to connect in order to connect to that demon and then it needs the socket connection to be able to communicate back and forth. It also does things like get the X adders out of the file system. So that's basically all of these additional things that needed when we go up with the... But the basic idea is that you can sort of continue to watch a container image and you further generate more and more syscalls. So obviously out of the box you can't figure out all possible ways people are gonna run an image. But imagine you run in a Kubernetes cluster and you could basically say, I'm gonna put this new application through my CI CD system. I'm gonna run all my tests on it and have the syscalls filtering tool watching it constantly. And once it gets out of the CI CD system I have probably a pretty good idea of the syscalls that are used to run that container. Now I can take that container and put it into production with those syscalls and actually have another tool just use the filter again. And this time the filter is watching for additional syscalls. And say you run it for three months in sort of that permissive mode. And after three months if you find no additional syscalls being added then probably at that point you say maybe I can put it into enforcing mode. But basically this gives you the tools to be able to basically monitor a container and do it. So imagine we could take similar to what we talked about before with capabilities imagine I could take this generated Seccom file and actually how do I distribute that to the environment? And when I would subscribe is that we put that Seccom file inside the actually inside of the OCI the JSON that describes what's inside the image the developer could basically go out figure out what Seccom rules this container image needs and then associate it with the image and embed it. Then the container tools could be made small enough to basically look at what the giving filter fault those default 300 syscalls that it gives and look at the ones that the container image is asking for and make sure that if the default 300 the image asks for all of the same allow rules that are in the default 300 then they're just running in the much tighter environment and it has to be no interaction with the user. There has to be no interaction with the software management or anything else that just would work out of the box. Okay, so SC Linux, I often talk about how great SC Linux is for securing the system. And if you look at just the number of security vulnerabilities that SC Linux has blocked in the container world, it's pretty impressive. The last one here, the last CV down here actually was a CV that allowed a root process to overwrite the tool that wants the container. So RunC is the tool that people use to configure all the containers. So imagine if you ran a container that had this vulnerability, you could overwrite RunC and then all future containers get launched would be under your control. So SC Linux actually protects the file system from container escape. And that's usually where the container escapes is by getting out to the file system and being able to wreak havoc. And SC Linux basically keeps things contained. So it's blocked most containers, best tool in my opinion for file system escape. But SC Linux has additional features like it has the ability to control capabilities but we basically, for their fall policy, we allow containers to have access to all Linux capabilities and have full access to the network. So we're not using that pots of SC Linux to control because we're relying on other parts of the kernel for doing that. So, sort of again, it's the Goldilocks, right? To control. But so one of the problems that when people, containers work real, real well with SC Linux but the one area where people fall into issues with SC Linux is around volumes. So, basically a volume is a way of exposing parts of the operating system into the container to take parts of the file system and inject them into the container. And so SC Linux functionality was added to Docker and Podman, you realize that any content that you take from the host is gonna be labeled incorrectly for use with the container. So we added an option to the volume amount, locate Z and capital Z. And then what the capital Z does is basically tells the container range and Podman here to fix the labels on that directory and just runs a recursive change of all the content. So if you basically run a MariaDB and you want to have the database stored on your disk you can just create a directory, mount it into the container, use a control Z, capital Z and Podman will relabel that directory and all your content for that directory will be private to that container. The lowercase these allows you to have it public to all containers but still be isolated all the rest of the content from the host. But the second line down here, Podman run by a log, it could be a problem. So if you were running a Fluent DD and you wanted to have access to all the logs on the host in order to say export them, running the cap colon Z here would be a real serious issue because it would basically relabel all the content of the by log. And we might have other, there's probably other parts of your system that are confined that I'm allowed to write to a container label but need to write to the by log. So you'd have lots and lots of the system would blow up if you run a colon Z at a higher level. So this is the usual place where SE Linux sort of uses stumble upon it. So what do we do for that situation? Because we don't want people relabeling system directories like that, that's a bad idea or the host will break. And so the only option by default is to basically turn SE Linux off while the container to really basically execute a security hop label that equals disabled, which tells container not to use SE Linux separation. But that kind of sucks because I just told you that SE Linux is probably the best tool for a container separation that we have. And now we turn it off for a fairly common use case. So the upstream SE Linux maintainers have built a new tool called Udica for us Americans. But Udica is a tool that can actually go out and it understands containers and understands the JSON file that's associated with containers. So basically it'll examine, you build a container and it will examine the content of that container and will generate an SE Linux policy based on that. Oops. Let's see if we can get this running. So here I'm going to run a container on my system and I volume out it in slash home into the container and by school. This is a standard one with SE Linux in enforcing mode. And if I do an LS of home, it gives me permission to deny it, right? Because we don't want, if a container broke out to be able to go out and read the home directory of user. And similarly, if I wanted to, we mounted this about school as being where world writable or rewriteable, if I wanted to go out and do something there, I'm also going to get permission to deny there from SE Linux is the only thing that's preventing this on the system, but you sort of stuck. So if you wanted a container that did this, you would basically have to run the machine with SE Linux disabled for the containers. So now I'm going to take you, you did that. And basically I'm going to look at the container I just generated and pipe it into you did that. So right there I inspected it. And what happened just in that inspection, you did that went out and generated policy. So here's a couple of commands. So I'm about to execute two commands to update the SE Linux policy on the system to basically be able to create a new policy type for my container. So now instead of the first time I just ran with the standard label, but now I'm going to run my container with my generated type. So this created a new my container process type, but that's the only thing that's changed. So instead of having to run SE Linux disabled for this container, I could actually generate a policy for the container that easily. And here you can see the container running on my system with the new label. And now I'm going to enter the container. And if I go into slash home, I see content. And if I go out and I touch content and VAS pool in, I get able to do it. So this container is still under confinement, but the only thing it's able to do now is read the content that's in slash home and actually, you know, and then write to content and VAS pool. And so now you can take something like the fluid D container that we showed earlier and actually running with SE Linux and enforcing mode still locked down, but slightly looser, slightly closer to mama bear as opposed to going all the way past mama bear to be in disabled. So the last thing I'm going to talk about in this session is user namespace. So user namespace is something that sort of people have dreamed about for years and years. And the basic idea is user namespace allows us to map non-root users, give root to a container, but not be root in the system. So if you broke out of a container, even though you root inside the container, if you broke out, you would be non-root. And so this is actually the secret of podman. It's podman, you know, when you're running rootless podman rootless builder, we're actually using the user namespace. So we use the namespace allows us to fake you out. So when you look at root inside of your rootless container, if you left that container and looked at that process, that process is actually just you, it's your UID. So all we're doing is mapping your UID inside of the container, your UID outside of the container to be root inside of the container. The problem is that other than podman, none of these cryo and Kubernetes and Docker and stuff like that is ever used user namespace. So really we're not using it for container separation. There has been some efforts to potentially run your container mentioned in the user namespace, but not to have each container launched with a different user namespace. So with user namespace would be really cool if we launched every single container with a different user namespace for each container with a different user namespace. And we're working towards that and we're hoping to get into Kubernetes soon. So there is some features that are lacking but we're building up our tooling to be able to do things like, fixing up the file systems to be able to work. And so here I'm gonna do a quick demo of username space. So podman has built in, you can specify username space. I can take a little container and if I looked inside the container right now, if I entered that container, it said it's running as root but if I look at, I use podman top here to show you that that container is actually running as UID 100,000. And here, if I actually examine the system to look at that sleep command I just executed you see it's running as 100,000 on the system. And then if I launched the second container as UID 200,000 the reason this is pausing is actually churning the file. Okay, so now created a second container. So now I have two containers on my system running. Both of them think they have, both of them have root inside of the container. They have to do things like root inside of the container in a multiple UIDs. But on the host, one's running as UID 200,000, the other one's running as UID 200,000. If those container processes broke out and got access to the system, the system would just treat them as UID 100,000, UID 200,000. So they wouldn't have any power, basically sort of this standard security that we've been relying on for 50 years in Linux. We have full UID protections. So we've added new, the problem with that is, you have to, the user again, has to be able to figure out which username space to use to basically make sure each one of those containers is secure. So we've added a flag to Podman again, to basically automatically pick unique username spaces. So this one went out and picked this huge number as the process running on the system is running as that huge number. And now we're gonna run another container and it's gonna pick a huge, different number. So now we're basically, we don't have to do any schmacks on the system. And basically Podman in its databases or is in a storage is actually keeping track of all the username spaces it's using and able to give you different username spaces for each one of them. So here I created another one and this just shows you what the mapping inside of the container looks like. So this one says that it's mapping UID zero to this gargantuan number. And then we asked for only 5,000 UID. So it basically maps to the next 5,000 UID since the container and you can take advantage of username space. Podman under the covers is setting that up to be able to, use those random UIDs and fixing up the file systems and things like that so that everything will work. So right now there's a pull request inside into Cryo I think it's just been emerged to allow Cryo to do similar functionality to that. So Cryo will be able to launch lots and lots of containers underneath Kubernetes, each one of them inside of their own username space. So at this point I'm gonna stop sharing and see if anybody's asked any questions. Has someone been reading the questions to ask them? I can do that. That'd be great. So Mike asks that these runtime names, basis selections are separate from what we would previously have to specify in ETC sub ID file. So the SC sub UID file and SC sub GID file is something that is specified for each user on the system. What I'm trying to show here is I'm running lots and lots of containers as root and then putting them into separate username spaces. We can do similar with rootless containers except that the number of UIDs that I've allocated. So by default, a user running on a Linux system gets 65,000 UIDs. If he's gonna create multiple containers using auto equals F, he's gonna use up his number of UIDs very quickly as opposed to if you root on the system you have basically four billion UIDs. So if you even allocate him at 65,000 UIDs each you can get better separation. So the real goal to me is not, well it's a cool feature for rootless. It's also a much cooler feature if you're running a server with lots and lots of services running on it. Those services require real root to use username space and really the end goal, the real goal is to get this into Kubernetes so that Kubernetes could launch 50, 200, 300 containers each one of different username spaces and we basically added another layer of defense. Okay, so James Kessel asks that it's ordered to allow for containers. Yeah, it's actually smarter than auto to allow. So auto to allow just basically waits till something goes wrong and then this is Utica, Uditsa. Auto to allow just reads what's written to the auto log. These are the things that were blocked by, I see Linux and then basically has allow rules. What Uditsa does is actually looks at those two volumes that were mounted into the container and see it figures out, oh, he needs to be able to read. It's a read only flag on the slash home directory. So it generates policy that says he can read any type that is defined and stored underneath slash home. Similarly, the VAS pool was read writeable. It looks at VAS pool, looks at the other policies and install the machine and generates policy that says it can read and write any types that are stored under VAS pool. So it's really sort of an intelligent and creative way of handling this. But yeah, it's not an after the fact it's basically before you're running container, you basically define, design your container and then realize that the Linux is gonna be a problem and you run Uditsa to examine how the container is and says, it comes back and says, oh, this container would work on better if it uses this as a Linux type and generates the policy to be used. Right, and finally we had Michelle Salim ask that is overlaying the labels and option. I guess this is in context with the Linux files. Yeah, so really you're talking, I think what you were asking there is could I have two different types for a file? Could I somehow, I guess do a bind mount and have it changed the label of the files? And really from a security point of view, SC Linux has always blocked that capability for going into the Linux kernel because they believe that if I have two types on a single file on the system, the security of the files is not, you can't judge it, right? So it's sort of, basically they want every object to have a single label on it and then you enforce that label based off of a, there's an analysis that could happen at that point by the kernel to make sure that you, this process only has it and it's in, they need to have ways of discovering and studying those types. So they never allowed two types to be stored on a single file at a single time. As far as bind mounts, there is a, SC Linux, they call the context mount. So when I do a mount of say an NFS directory, I can say this NFS directory is gonna be a container file T type, but bind mounts don't support that, mainly for the same reason that, so it will be easy to confuse the Linux kernel if you were able to do bind mounts. Great. So we have one more question. Mike asks that the Eureka rules are specific to the image containers that they were generated against? It's, well, theoretically, yes. I mean, because it's, yes, it's more towards the container than the image because it's examining the volumes and capabilities and things like that that you put when you ran the container. So in this case, the developer of the application has designed say a podman or a Docker command to execute his container. And you might specify somewhere in help document, this is a command I would suggest that you would run for running the container. And what you need to would then do is look at that and be able to generate policy on the fly to allow you to control what that container does in the system from as to the next point of view. Great. I think that's the end of all the questions. Thanks a lot, Dan. I think this talk was really informative. For one, I didn't know I could do 300 syscalls from my container, so that's great. If you want to look at the syscalls, go to usershieckcontainers.com.json. That's the whole list. Okay. I guess I think someone's supposed to be talking right now. So I got to get out. All right, bye now.