 The next talk will be given by Dan Walsh from Red Hat about generating second profiles for containers using Podman and EBPF, enjoy. All right, thanks for having me. I would say last year I gave a talk on replacing Docker with Podman. I mean, this is not that much about Podman, but Podman's the tool that we're using to do this. But I don't know if I did this last year, but I usually do this before, since it drives me nuts. Everybody please stand up. Please read out loud all text that is in. Okay, this is my Dan Coyote movement, I'm just trying to get people to expand behind, not have to prefix every word with the OCKEA, unless you want to call all Linuxes Red Hat Linux. So the goal of this talk or with this tool that we're developing is to basically run a container with tighter syscall filtering with Setcomp. Does everybody know what Setcomp is? So all right, good, most of you. So basically what Setcomp does is allows us to shrink the attack surface on the kernel. So it's one of the, when I give talks on container security, I used to talk about things like SC Linux and capabilities. And you get to Setcomp. And Setcomp's really cool in that we can basically take a Linux system, the x8664 machine, there's about 650 syscalls. And if we can cut out a whole bunch of those syscalls, then a vulnerability annual of syscalls that the light leads to a kernel exploit, if you don't have that syscall, you can't get the kernel exploit. And so what we really want to do is cut down on syscalls. But when we're looking at this, so this really, this is what Wikipedia defines as Setcomp, but this really talking about the initial Setcomp, which I cut down to, I think, four syscalls. I think it was read, write, execute, it's in there somewhere. But we're really talking about Setcomp BPF, which is the thing that Google further enhanced the original Setcomp to basically allow you to have all syscalls and then allow you to choose which ones you want to control. So the goal, again, is to run containers with tight syscall filtering. So the state of the outing containers was actually developed by Jessie Frizzelle. So Jessie Frizzelle, at the time, was working for Docker. And she wrote a really, what you call the rant on usability, usable security. So really what we needed to do is we wanted to take advantage of Setcomp, but the type of jobs that run inside the containers are so varied. And she said that having written a default Setcomp profile for Docker, I'm pretty familiar with how hard it is for people to use. And it requires a deep knowledge of application being contained in syscalls. What syscalls does a container require? And then she says down here, turning on something that will cause ePerm by default if we left it important, syscall is terrifying. So really what she's saying here is if we had a tight syscall filters running inside of all these containers, then it would just cause people to turn off Setcomp to turn it off. So we had to have a fairly loose capability for Setcomp. The idea is to build what she wanted was a build time generated and applied on the run Setcomp filters. And then she basically said down here, if it causes any problems, people will just turn off the security. And I got to talk a little bit about that at the end. And this happens all the time with, and of course, I see Lennox you mentioned, because I'm sure everybody in this room has a set in force one. But anyways, so my picture didn't show up. That's too bad. I always had the problem with the term one size fits all. So there's supposed to be a picture there with a little tiny hat on its head. So imagine, I have a size eight head. So it's a very large head. And anytime I shop for a hat, it's always a tassel. And you pick up these hats and say, one size fits all. And I put on my head, and it looks like a little beanie on me. So security is the same thing. We have one size, with our container world, we have one size fits all for the security. And really, unless you're going to go deep into analyzing how containers run, how do you figure out what size this calls your applications going to do? So what we decided to do this summer is, so I'll talk quickly about what SC Lennox did. SC Lennox really had the same problem. SC Lennox's policy is basically picking out a certain application, say a patchy application database. And what SC Lennox did is we would have sort of general rules. We'd have a general idea of how an application would run. And we'd write a generalized framework for what the application did. Then we would basically put that policy in place and run in what we call permissive mode. And just basically run it through entire test suite. And then we'd go to auditing system. We'd go to the audit logs and basically continuously pull out the rules as they came along. And there was a tool called Audit to Allow that translated the ABC rules that were showing up in audit logs and to allow rules. And then we'd go back to step two and just keep on it. And I don't know, people, my first computer course I ever took was explaining how you would write a program for taking a shower in the morning with shampoo. And it used to say, allow the rinse repeat. And of course, you'd be in the shower until the thing ran out of shampoo and while the kernel crashed, right? So basically, you'd keep on doing this. Then we would take that policy and we would actually put it out into Fedora and to Rahide and we'd get bug reports and continuously get bug reports. And over time, we figured out what the application did. So it's not easy to figure out applications, how they work. So this summer, we decided to do a Google try to build tools that we could at least do what we did with SE Linux to figure out sec-comp rules. So this guy here, Divyash, did most of the work on this. And Valentin Rothberg is on my team and he really sort of mentored him and I was involved in the process as well. I wanted to have one of those two guys give us talk, but neither one could make it here. So that's where you get me. So the first step we wanted to do is we wanted to investigate how we would figure out what sys calls were happening inside of a container. And the first idea everybody has is, oh, just use this S-trace or a P-trace. And if you saw our Inspector Gadget yesterday, you'd realize that that's slow and very difficult to filter. It's also hard to figure out what the process inside of the container are and how to follow along with that. So the next step is we could do what SE Linux did, which is basically turn on sec-comp filtering, auditing, and then just keep on going to the auto logs and get the information out of the auto log about which sys calls are being called. So the kernel can report those to the thing. But the problem is the kernel is just gonna start spewing all audits logs, all sys calls to it, or you could get it down to a process, but there's no idea in the audit log what a container is. The Linux kernel has no idea what a container is. And there's been a pull request in the kernel for five, six, seven years now to add what's called a container ID to the audit log. So you could actually go into an audit log and figure out at least all these processes coming from the same container and maybe eventually trace a bunch of processes back to an individual container. So because of that, it became very quickly difficult to figure out with the audit log, which container generated them. So we basically went and looked at the EBPF at this point. And EBPF was just, to me, is a really cool thing. And as I said, I'm giving a lot of credit to Inspector Gadget. That was really an awesome presentation to show all the different things you can start to do with EBPF, basically revealing information out of the kernel. So what we're doing is very similar, basically watching for sysentera of the process, but basically the process every time you go into sysentera and that on a PID ID, and you're able to watch all the sys calls that basically that process generates. And we can generate, watch all of its children and grandchildren. The other thing that we did, so the talk on Inspector Gadget yesterday talked about they were taking advantage of C-groups V2 to basically figure out which processes are inside of the container. The problem is we didn't have C-group V2 everywhere. So we relied on, we're basically looking at the syscalls as they come in and looking at the PID namespace and basically deciding whether or not the process is inside the container based on the PID namespace. So it's a little bit different, but eventually once C-groups V2 becomes prevalent, I think we probably want to take advantage of that filtering. But basically, so this is basically what's going on inside of this thing. So one of the things we needed to do is figure out, an interesting problem in this is Run C creates the process of the PID one of the container early in the process, but then it continues to do stuff in that process of skating and those things that it's doing, we don't want to record as being part of the second filter. So what we really needed to do is figure out like this three or four syscalls that come in in the beginning that we basically need to dump out because we don't want those being allowed because they're very, very privileged. Because imagine Run C is sitting and using namespaces and names, all that type of stuff we didn't want it being allowed to the container process. So he's opened up a pull request and originally this talk even says that this is associated with Podman. It's being done underneath Podman but he's basically creating an OCI runtime. So an OCI runtime that can be run inside of any container engine. So you could run it inside your Kubernetes underneath Cryo, ContainerD. Theoretically you can run it with Docker, Podman. Anything can do it because it's really separate from the tool but we'll be using Podman to generate it. Right now it's a pull request on Podman but I think we're gonna put it in as a separate package so that people can just download and play with. So how to start tracing? Well we need to know when the container starts. So you run Podman, the container doesn't start when Podman, Podman's gonna go out and create an OCI runtime. OCI runtime's gonna be run, read by Run C. Run C eventually is gonna do the fork and exec to create the pin one in the container and then it's gonna launch, actually exec the container so the container doesn't start till that last point. So the best way to do that is with OCI hooks because an OCI hook is in the OCI spec, the OCI runtime will call out to individual hooks at certain prospects of the development phase. So basically after it creates the pin, pin one of the container, it actually stops and calls out to the OCI hook and basically hands us at that point the actual pin that's gonna be the pin of the container but the container's not doing anything yet. So history of OCI hooks, they've been around for a little while. They allow you to do things like pre-stop and post-stop. So the OCI tools will call in as the container starts in, as the container stops, different phases of it. So what happens is we run the trace of when the OCI on the pre-stop attaches the EBBF program to it, watch for the antitrace and then start mapping. And then when we're done, we basically send, right now he sends a signal to the process to say the container is done, so save out to your file. So we use podman for testing and one of the cool things we did with OCI hooks, and I'm gonna show you what an OCI hooks looks like in a second, is you can actually set up OCI hooks so that they only run on certain conditions. So there's like an OCI, the first one we built was OCI system D hook, so we could set up system D environment before it runs a container. But you'd wanna know if the container's gonna be running PID 1 of system D, so another thing you can do is you can do this annotations. So this is the way you trigger a container to basically start doing the SETCOM filtering. And it basically says using IO container syscall and then we're outputting to a certain file. And that's the file we're gonna look at, generating the syscall filters. Okay, so let's actually do a demo. Okay, so when I talked about one size syscall just before I sat down to do this, start talking to you guys, I went in and grabbed, I remember I said there's 650 syscalls in x8664, while just turning on SETCOM, you eliminate half of them because you don't tend to run 32 bit code inside of a container, so we can turn off all the 32 bit syscalls. So that drops us down to around 325 syscalls. If you go through Jesse Frazel's syscall thing, it drops it down to about 313 syscalls. So as much as SETCOM seemed like it's gonna be powerful, that we require all these hundreds of syscalls just to have general purpose and not everybody turning it off. So it's good, we went from 650 down to 313, but it could be better. Okay, so this is what, when you're running hooks inside of OCI hooks, this is sort of what they look like. You have the definition of the hook, the executable, and then you can basically tell the specification that I only want to run it if I have an annotation that looks like that, otherwise the executable won't start, so you don't have an overhead if you're not filtering. So that's basically what a specification looks like. So right now I'm running, so up here you see I'm running the container, I'm running podman, annotation of that, writing out the syscall, and I just did an ls of slash to generate the syscall, and that generated something that'll look like this. So basically it says I'm generating, it's a JSON file that the container, the OCI, understands, and it basically says default action is to return an error, but if I allow it then just allow it, and these are the rules, the syscalls that it actually found when it was running. So now I'm gonna run the container, so this time I changed the option on podman to actually use my newly generated Seccom filter. So this time, the original one I used the default, which was the one Jesse Frazell wrote, now I'm running the same exact command, and sure enough, it fully allows it. So I decide to change around down here, I'm about to run it, and I'm gonna run it with an ls dash l. Okay, so just slightly more, but I'm still running with that nice tight policy, and guess what, I got permission denied. So because I'm running Seccom filters on it, now just adding the dash l causes a problem. If I go into the audit logs, you'll see all the audit logs generating all the syscalls that happen. So basically if you look at this, the first one didn't do it, this one's just looking for x add, it's basically trying to read the attributes of the individual files inside of it. Down here is actually the ones that we're missing. So there's actually a connect, a few text, getpids, getpids. So basically what's happening here is this is actually going out and creating a socket. So just doing an ls dash l of a file actually goes out when it's doing the getpwusers, basically looking at the user ID, that's actually gonna talk to SSSD on the system and actually get connect to socket, Unix domain socket to that. So you can imagine that there's just suddenly there's an expansion of the amount of syscalls just by adding that. So now I'm gonna take these syscalls that got, basically I have bash groups that grabbed them. This is when you do the unthinkable and actually run a BI on a JSON in front of 50 people and hope that you don't screw up. So I just added those rules to my Seccom filter, to the Seccom filter and voila. So it just took those rules, added them in to the system and basically allowed me to run and now I'm just gonna give you, so here's the original rules versus the new rules added to it and that's it. So basically the idea here is that we could take this tool and start to run, say in your CICD system, you might want to run all your tests, the full test suite on top of individual container and then generate your Seccom rules based on that and then when you're shipping it, you would ship that Seccom filter and you can use with any container runtime, you can specify I have one minute left, holy smokes. Okay. So anyways, you could ship that Seccom filter. Obviously there's gonna be problems and that's why we turn on the audit. So any of the, any syscalls, you can start to monitor those audits to see if you're getting any denials but this gives you the opportunity, potential opportunity to generate syscall filters at a very small space. Couple of problems with Seccom, this one, a lot of people aren't using Seccom right now because they think it is slow but we actually found that for some reason, live Seccom by default turns on the Specter Melton protections which is a huge performance hit, like a 25% performance hit. So a lot of people are saying that Seccom is slow because this is a side effect. So the latest code, we just got it into the OCI extreme. It allows us to specify whether we want the Specter Melton to go on and it allows us to turn on that, the auditing system to be able to do it. So this quick section I'm gonna ask for questions and anybody who wants to talk about friendly E-Perm, I'd love to talk about that but I'm out of time. Any questions? Yes. I guess when you run an application is you are not going to take all the possible code paths or there are always going to be possibly a syscall that you miss. Correct. So the question is basically, there's always gonna be additional syscalls and the funny thing is I can actually, I mean with containers it's less likely but like traditionally in SE Linux it was more about people modifying, say in a switch files and so all of a sudden they add this full LDAP stack that's happening because I did an LS-L. So I'm calling out the DNS resolvers and all this stuff so there's all these side effects potentially. Yeah but it's a fundamental problem as we try to tighten the security on things that depending on the different code paths there's gonna be issues, there's potential issues. But one of the things we did with SE Linux is we actually, and maybe we'll do this eventually is there's certain actions that we know, we'll see one syscall like LS gets that and all of a sudden we know that a read involves like five syscalls. It's like lock, get at, or read. So as soon as you see a read we should just instantaneously give all five of those syscalls to it. So we could start to build up sort of understanding of the system to add a whole group of syscalls. But my goal is to get, if I could get from 333 down to 200 that would be a huge improvement. But I do take the risk of suddenly changing configurations gonna cause, get at it. One of the reasons I wanted to talk about friendly E-Perm is exactly that. So I'll just quickly go into friendly E-Perm before he throws me out of here. So friendly E-Perm was this was actually a proposal we had back in 2010. We saw this coming. Right now when you run a process on a Linux system there's about 10 different ways you can get permission denied, right? So you can get permission denied by SetCop. You can get it from user namespace. You can get it from Rayleigh. You, Unix ownership permissions that's called discretionary access control. You can get it from SCLinux. You can get it from App-on. You can get it from like five other LSMs. You can get it from user namespace. The tools basically get permission denied. You as a user what do you do when your application gets permission denied? Right, okay, pseudo. Okay, that's one thing you do. What's the other thing if you do and you're running containers with podman, what are you gonna do? Dash dash permissive, right? Right, so that's instantaneously we turn off all of the protections. And the reason for that is because you can't figure out, even if you contact me and I pretty much know why you're gonna get permission denied, I can't figure it out without making you go and do like 10 different rock fetches. Okay, turn off SCLinux. Okay, nope, that didn't do it. Turn SCLinux back on, turn off SecComp. No, that didn't do it, turn off capabilities. Start adding these certain capabilities. Who knows why you got permission denied? The frigging kernel knows. Okay, and the kernel ain't telling you. Or if it's telling you, it's putting it in some random place in the operating system that you don't know. Right, SCLinux, Lissues, and now SecComp, issues go to the auto log. So Apache gets an error, permission denied. It can't write into the it's log file saying, I'm not allowed to do this because SCLinux blocked me. Or I'm not allowed to do this because I don't have the capabilities. Right, and so we have this fundamental problem with security and the only option people have is to turn it off. So we opened a friendly ePerm years ago because we wanted the frigging kernel to tell us why you're giving us permission tonight. We wanted it to allow Apache to go to the kernel and say, why did you just deny me that? Right, and it's inherently racy. Right, the whole syscall, the only thing the syscall can return you is ePerm. So originally we were saying, well maybe we could return to like a secondary thing that said, here's a little tag that says why. After the secondary phase, we wanted to go to look into the system and say the PROC system and say, give me a PROC status, why'd you give me a frigging ePerm? And they come up with an SCLinux line that's saying SCLinux blocked it and now you could write it to your log file. And that's racy because me asking the PROC, I could get permission tonight. So basically Linus, we went back and forth on this for a while and Linus finally told us, get lost. And now we come eight years later, nine years later, and it's 10 times worse. Now everybody that's running containers is facing this problem. So with the stuff that he showed Inspector Gadget yesterday or with this feature, my goal now is to get to the point where we potentially could use EBPF to go to the kernel and say, so you could basically run a container, get permission tonight and say, okay, let me set an annotation, run the container again and have the kernel tell me why I'm getting permission tonight and you could actually figure out what's going on, figure out. Right now, as I said, there you go, there. So anyways, yes. When's the next talk? As a developer on some of these things, gonna get a friendly Ian Val as well. Yeah, well, Ian Val and ePerm are two ways of saying the same thing, but yeah, the goal would be to basically, when we have this, I don't know if it goes through Inspector Gadget, but basically this tool, and again, this is in my brain, not in any single reality, if it could figure out, yeah, allow you to specify which, which ever know from assist call you would want to basically ask the kernel, why did you give me that, yeah, that ever know and have the kernel reveal some information. Last question? Yeah. Which makes sense to rely on static code analysis to figure out which second profile you need, kind of as an answer to trigger and all the code paths. I mean, we haven't talked about that from assist call, point of view, but that used to be asked all the time with SC Linux. The problem is that it's not, you can look at your code to your blue in the face, but you have to basically go and analyze all of G-Lib C. I mean, maybe, maybe if you have a static program that that might be possible, but the issues with SC Linux, we kind of knew what the application was doing. It was when people would change the underlying tooling on the system that you would end up with problems just by, as I said, setting up, set up NFS. Put an NFS home directory on there, and all of a sudden, the whole world changes or, yeah. And that's, you don't have yellow pages anymore, but yellow pages used to be like, everything's open? Yeah. All ports, all things, so it's tough. Okay, thanks for having me. Thank you a lot.