 Hi everyone, I'm Divyan. So this summer as Dan told you, we have worked on this generating SecComp profiles for containers, so does everybody know what SecComp is? Okay. So for those who don't know, SecComp is a Linux security tool which was originally allowed, which is used to limit the attack surface of the kernel. So you can use it to specify which a process, which syscalls can a process access. Initially it just allowed for syscalls, but later it was expanded using BPF Berkeley packet filter. Then after that you could set exactly what syscalls you could allow for a process. So it's really helpful in limiting down the number of vectors you can use that are available for exploits. So all the container engines ship with this SecComp and they ship with the default SecComp profile. So the goal of our talk would be to generate a SecComp profile by tracing all the syscalls that were made by the container so that we could tighten the SecComp profile because the default SecComp profile that ships with the container, container engines was created a few years ago by Jesse Frazell. So the default SecComp profile blocks only 44 syscalls which is quite loose for a certain workload. So by a report by Aqua Security, a container normally needs only 60 to 70 syscalls. So we don't need around 300 plus syscalls for our job. So initially in our investigation we initially proposed to use ptrace which is an obvious choice in some aspects because strace uses ptrace to trace all the syscalls. So what exactly is ptrace? A ptrace is a system called which is used by, which is used by debuggers and tracers which lets a tracer observe a tracys, observe or controller tracys, registers or other things. So generally used for restrace. In x86, 64 systems you can observe the RAX registers to get what syscalls a process has called. But the problem with ptrace is that it stops every time, the tracys stops every time a tracer wants to observe the process. So after every operation it has to stop. So it's quite slow. So we may miss some time-based decisions. Also we can't really trace the calls made by Runcy using strace. So after the SecComp profile is applied we need some calls made by Runcy. So we need to also trace the calls by Runcy. So we scrapped that idea. So next we tried the audit log. So the audit log is basically the audit, you can, in May there was a poll request in Runcy which added the scmp-aclog filter action to the SecComp profile that's available. So with that we could use that to trace every syscall made by a container. So that works for a single container. But the problem is that if you run multiple parallel containers, then we won't be able to figure out which process is inside which container because it's quite indifferentiable. And if you look at the ptrace we could get it but it's not really trivial and we could run into some race conditions. So we scrapped that. We could do that by using container ID but the Linux kernel system, kernel doesn't identify containers as a separate entity. They think of it as a process. So we couldn't do that. So the next thing we talked about was using BPF. So BPF or EBPF extended Berkeley packet filter was initially used to filter out packets inside the kernel system. Kernel and later it was used for SecComp to trace, to filter out syscalls. So and it was expanded later to add the ability to inject user-based programs inside the kernel which is really cool. So the EBPF program is attached to a designated code path which means that you could set up some certain events that could trigger certain functions which you can use to gather data or do something. For example, trace points or K probes. So you can do that in the kernel space or the user space. So how do we trace syscalls? So the thing I was talking about is you can add trace points inside the kernel and which would execute certain functions. So the trace point we use was the center. So it triggers every time a syscall is made. So a function is attached to it so we can gather the data about what process called the syscall or what syscall was called. So how do we identify what process is inside a container? So every container has a paid names and every process of the container contains, every process has the same paid namespace. So we basically take the paid one of the container and note down its paid namespace and later compare every syscall made whether it's inside the paid namespace or not. So we can check if it's inside the container. So the important question is when do we start tracing? Because we need to start just before the container is run and just after the second profile is applied so that we don't miss any syscalls or we don't have any extra syscalls. So we have to run a separate binary which would trace, which would communicate with the EVPF program. So we need to use OCI hooks. So OCI hooks is a really cool thing that allows you to run binaries during different lifecycle stages of a container. So basically if you want to run a binary before starting a container, you could use start, pre-start, or after the container has exited, you could use post-stop. So we use the pre-start hook which runs a binary before the container is run but after the namespaces are applied so that we could start tracing. But the thing is that the runtime waits for the hook to exit. So we need to create a separate process after we run the hook. So the run-c or crun creates a process which creates another process which is used to trace the container and the original process exits. And the pit is provided by the runtime using stdin. So, but we can't start tracing it because once we're inside the namespace and the container hasn't started yet during the pre-start hook, run-c is still there. It has to do some operation. So we might collect some extra syscalls from run-c. So we need to wait for PR-CTL. So PR, we, after the first PR-CTL is done, we can pretty much guarantee that our second profile has been applied and we can start tracing the syscalls. So, demo. So, the way we trigger, is it visible? So the way we trigger our hook is, OCI hook is by providing annotations. Annotation or label is a way, is a label we pass on to the hook which can be used to trigger the hook. So the hook won't run every time we run a container. So, if I run this command, then first the run-c will start creating the container, then it will stop and execute the pre-start hook and then the pre-start hook will run and create another process which would attach to the sysenter trace point. Then when the container starts running. So, as a profile has been generated. So, as a profile has been generated. So, if I use this profile, so then the container runs, but if I try to run something else. So, if I do ls-l, so it would fail because ls-l would require some extra syscalls. So, I could do one thing that, I can provide an inputs seccom profile which can be used to expand upon a previously generated seccom profile. So, earlier I just passed in the output file. So, I could provide an input file that could expand upon what syscalls. So, now what it would do is that, is it visible? So, now what it would do is, it would take the input file and then expand upon it. So, it would add the syscalls that were not inside the input file, but it won't remove the syscalls that were present in the input file. So, you could use this to put it in a CI-CD system that could automate your testing so that. So, you can be really sure. So, the second seccom profile has been generated. So, I'll show you the difference between. So, the part above the, if you see this seccom profile, you can see two profiles as names. So, the above part is from the original seccom profile and the below part is from the new. So, these are the new seccom syscalls added to the profile. So, if you try to run them, so it works. So, this is the repository. You could take a look at it and play with it. Thank you for the questions. So, okay, so you need to run as root because you need CAPS's admin to inject ABPF programs into the kernel. So, you can only run as root. So, in our demo we used, we implemented this by using the PID namespace. You could also do this using cgroups. But cgroups have been added in Fedora 31. They were not there before. So, you could also do this with cgroups. Other than that, does anybody have any question? Yeah. So, could you repeat that question? So, he's asking if I had another container running on a system, would you accidentally get the syscalls from that container? How do you make sure you're getting it from one container, not the second? So, you won't get that problem because we create, for every container, we create a new process. So, you trace, so the process traces its own PID namespace. So, that won't create a conflict. If you had two containers running with the hook at the same time, they would have different PID namespaces. So, that when the hook starts up, it figures out what the PID namespace for the container it's gonna run and then make sure that any syscalls it gets, it's gonna match up to the PID namespace. Any other questions? So, the question is that, how can we be sure that we go through every code path? So, we can't be sure we go through every code path. You'll have to test it thoroughly. So, that's why there's an input file. So, if it fails, you can expand onto it because you can put it in a system that would, you try again and again until you're really sure about if it's right for you or not because you can't really be sure if you go through all the code paths. You could also run this container in production and with the hook running. So, you're not enforcing the hook but you could continually grab it and say after a couple months, you now decide, well, it's been running fine for two months. I'll get those, the current seccomp file that it's generated, take that seccomp file and now make it enforcing code. Yeah, I can't hear you. Impact on performance? We'll let the kernel performance team that's sitting in the front here tell you, answer that question. But using a EBPF, how much is that gonna cost us in performance to watch a container? So, Shaq said that it has no performance overhead at all. Again, he said no performance overhead. Since we're in the kernel, I don't think we can get any better than that if we want to trace. And what seccomp is doing is exactly that. It's basically an EBPF filter that is running in the kernel and prevents the execution of certain syscalls. So, we somehow use an EBPF program to in the end create another one. Some people are asking why we are not doing static analysis. And the problem with static analysis is it's more something specialized. If you know what you're running or which tools in which programming language, you could basically analyze the binaries and look for which syscalls are there. But we wanted to have something generally applicable. So, something that is easy to use and to trace. So, in this case, as Tivian said before, you need to run it for a while in production and try to execute as much as possible until being really confident that it's working. Yeah, so, couple things. Another thing to think about is that this is gonna get, even if you have a multi-process container or a multi-service container, it's gonna get all the syscalls for all of them. So, I think we're just about out of time or at least she's getting the sign ready. I'd just like to point out that he's only a freshman in college, so I'm pretty good to work, okay?