 Hello, welcome to KubeCon. Today I'm going to be talking about Setcomp, what it can do for you. I'm going to talk all about what we've been doing in Setcomp in the container ecosystem and what is it and why does it matter and what is there, what work is there to do in the future. So, I'm just in Cormac. I'm an engineer at Docker in Cambridge and I'm also on the CNCFTSC and a naturey maintainer. And I'm very much involved in the container security ecosystem. So, what is Setcomp anyway and why does it matter? Setcomp is a thing that stands for secure computing, which sounds like a really great thing. It's a bit of an ambitious name, perhaps. Back in 2005, when it was first created, it was basically an extreme sandboxing method when extreme sandboxing wasn't actually that common a thing. It was really just for code that just does computer operations and could only read or write from existing files and basically exit. It wasn't used very much because there actually aren't very many programs that literally just read and write from files and can't, for example, make new network connections or anything like that at all. So, in 2013, a more general version was introduced which is called Setcomp BPF, but it's usually just called Setcomp because it's the most commonly used version. And this last small BPF program, so BPF is a technique originally used in the network stack. You've probably heard about recently about its grown up EBPF extended version, but BPF is the kind of original simple version. It lets you write very simple programs to decide if system calls should be allowed or not or if they're not allowed to whether they should error or be logged or kill the process that's running them. So, effectively, because system calls are the interface between applications and the kernel, this is basically is a method of controlling what programs can actually do outside just computing stuff. So, what kind of interaction they have with the outside world, which sounds really useful. So, in theory, you can take a look at what the program's doing. System calls are shown by Strace, so I've shown the Strace of just doing LS. And for each of these system calls, you can basically say, yeah, that's fine. No, that's not fine. You haven't got permission to do that. You can pretend the system call doesn't exist at all, which is enosys or any other kind of operation. So, that's the theory. In practice, it's not quite like that. So, for our examples there, when you open a file, with set compute, you don't actually get, your program doesn't actually get to see what the file name was, unfortunately, because it just sees the direct arguments of the system call. And the argument for a open system call isn't directly a string. It's actually just a pointer to a string. And all you get is the pointer and you can't follow the pointer to see what it pointed at, which is the name. So, you have to make decisions actually based on quite limited information compared to what you might want. And so, this is kind of limiting. You also can't know what kind of file descriptor's been used if someone's doing a read. It could be read from a network or read from a local file and you might want to allow one and not the other. You can't do that. There's also weird peculiarities. And you also didn't really get any context. You're just called every time the system call and you can't actually keep state in between those. So, you can't say you can do this after you do that but not before or anything like that pretty easily. So, there's a bunch of real serious limitations. The history of it applied to the container ecosystem. Back in early 2016, one of the first things I worked on when I started Docker with Jesse Frizzelle was adding second-comp support for Docker and it was enabled by default, which was a nice thing to have done and most people don't disable it. So, most people have had the benefit of using it since then. Kubernetes spent a long time working out on its implementation and really only in 119, so not very long ago now, finalized the API and it does not enable it by default. And it's still somewhat complicated to manage in Kubernetes is because second profiles tend to be at the moment very long files like the Docker default one is 800 lines long and apparently 800 more lines of YAML was seen as too much. So, you have to work out how to distribute files with this configuration on if you want to customize this. Or you can just use the runtime default setting which will give you something that's pretty much like the Docker set up basically as given you by the runtime. It'll be the Docker one if you're using Docker, if you're using container D it'll be or cryo it'll be similar but due to kind of maintenance things are not exactly the same. In the sort of OCI sort of structure of hierarchy we have in the container thing where we have Kubernetes and then CRI and then we have runtimes. It's kind of complicated set up because and we'll talk about this a little bit more later but there's a bunch of abstraction layers this goes through effectively everything's just passed down all the way to RunC which then it calls the go binding to Libset Comp which is a simplified version of what the actual set comp BPF looks like. And these simplified calls are generated from a JSON config that is slightly, it's basically abstract again the kind of rules that you can use with Libset Comp. So there's this weird and then the runtime actually might have a different form of the JSON which has different runtime customizations. So it's a kind of messy kind of process of converting JSON to JSON to go to see to BPF to run in the kernel. So it's a bit kind of messy. What's the point of this? What are we actually trying to achieve? I think it's very important to understand that. There's some system calls in Linux which are not really considered safe isolated programs to use. Some of these have a very, very large attack surface and there've been a lot of CVEs around them. Perf event open is one of those username spaces within BPF we'll talk about a bit later. These are just very large subsystems that in general most containers don't actually need to use. Often these are used by runtimes and other sort of software rather than actually end user applications. And there've been a lot of CVEs that basically meant that you can escape from a container if you can access these CIS calls. So blocking them is actually proven to be useful. We'll talk about specific examples of where there's been a benefit later. Some CIS calls can disable security features such as the PRCuttle Adrenaline Randomize which basically the disables ASLR address based layout randomization which is a security feature that's been added for good reasons. So applications can simply turn it off which is really unhelpful. And some things are obsolete. There's a CIScuttle as opposed to using CISFS which in Linux has historically been generally has been deprecated for some decades now and has some attack surface and is not really maintained. It's some distros I think have removed this now but all of them. And then there's always been some things that have not been namespaced. Time namespaces are very, very new and it came out a few months ago and key ring namespaces don't exist yet. So there's a bunch of stuff that in a container system it makes sense to just remove the ability of applications to use it. So what have we succeeded in doing with the Seccom subsystem? Username spaces. There's this quote from Andrew Lytmurski who's one of the channel maintainers. You know, basically saying that the huge attack surface from username spaces is huge risk and if unprivileged users can program IP tables they're bound to be some privileged escalations. He said this was before this was, you know, this was actually quite a few years ago before when this functionality was kind of new and shortly after that, I think the same year we saw CV 2016, 33, 134 and a bunch of related CVs here where indeed the IP tables code had some bounce checks missing and you could basically exploit this to get a full provision container escape. Normally this needs CapNet admin which is not granted. So normally it's safe but if you have username namespaces you get CapNet admin in your username space. You can call these commands in your username space. You can't change root IP tables functions but you can compromise the kernel so it doesn't really matter whether it's actually which namespace it's in. This was mitigated by Docker's default policy and so users using that were not affected. More recently, again, the BPF Verifier which again is a new feature for extended BPF. I had some bounce checks on 32 bit operations that were not enforced and you could read and write kernel memory which basically means you can control the entire host. Again, an unprivileged user with access to the BPF Cisco could do this again. The Docker policy blocks use the BPF by default unless you actually grant CapSys admin which is basically a privileged access anyway. So this sounds good. What went wrong? Actually, we caused a lot of problems for users during the last five years with SecComp. I had a bit of a war on Emacs. I stopped people running Emacs in containers with SecComp enabled for many years. This was a really strange story. It really surprised me when the complaints came in about this quite early on. Emacs had this very, very strange thing that a lot of people didn't like for other reasons. The muscle lip see maintainer was against it because it didn't work with muscle lip see as well. But basically, in order to make startup of Emacs faster, it used to, during a build time, it would start running the binary, then dump the output and then instead of rerunning the code that generated the initial set up, it would just load the memory snapshot basically. But the way it required to do this was very normal but it required memory locations to be loaded exactly the same way as they were before. And if you had ASLR, this was not the case because memory locations would be randomized and so it wouldn't work. So it disabled it by disabling randomization but this was one of the things that explicitly blocked because this basically is allowing applications to bypass a security mitigation. Eventually Emacs realized that, well, computers were fast enough that you could just run the startup code normally like a normal application and not stop being so quite so weird. And so this problem has gone away and now you can happily run Emacs in containers. But really, I just felt it was not worth changing the default policy to basically reduce security for everyone just so that Emacs could be run more effectively with its weird things in a container. Worse than that though, worse than breaking Emacs, I also broke Steam. This was not on purpose and not something I wanted to do. Linux has made a bunch of changes to 32-bit, Cisco API, and Steam happens to run 32-bit binaries and is widely used in containers. And this was something that happened really quite early on that Debian, I think, has a habit of doing these things first. They changed from the old socket call Cisco that was a weird multiplex Cisco that could do socket or bind or connect or any of the other socket calls and switch to separate Cisco's. And we hadn't actually allowed for this change. And I think Debian did it early. And they did the same thing with 64-bit time support on those two-bit systems. They again, they switched early before it was officially upstream and these were all temporarily blocked by Sec Comp until we fixed this problem. So it is a problem and a fragility issue with Sec Comp because it requires exact Cisco lists when some new set of Cisco's that suddenly everyone starts using to come along, you really have to update the code quickly, which is really problematic and apologies to the Steam users. There's also a performance issue. There's actually a lot of rules because we list the Cisco's you can use and not the ones you can't use. And the list is very long and it's not processed terribly efficiently for reasons that has mostly somewhat fixable but require a lot of work. Only really IO intensive applications will notice this. And so actually very few people complain but a few people have and they've generally disabled Sec Comp rather than actually fixing it. And then there's some interesting areas, security issues where Sec Comp didn't actually help at all and we didn't do anything to help users. One of which is probably my favorite kernel CVE that Jan Horne found. This is a really interesting security issue and the next is a cache invalidation bug. Basically, there was a 32 bit counter and if you did the right thing at the point at which the counter wrapped round back to zero again, you could basically exploit the kernel and escape your container. And all you had to do to do this was some memory mapping and some cloning of processes which is all totally normal stuff that we couldn't possibly block with Sec Comp. So there was just no way we could protect against this kind of thing. Eventually it was fixed by changing a counter to be 64 bits. 32 bits is too small for any kind of security on anything you can always overflow a 32 bit counter but overflowing a 64 bit counter is pretty impossible because it's so huge. Actually, it was actually interesting that there were, this is still a really interesting CV and worth looking at but it was hard to exploit without having some additional source of information to know when exactly you'd hit the conditions for the exploit. And so actually we fixed an information leak that made it actually relatively exploitable in containers because of the information leak rather than actually because of the Sec Comp but Sec Comp could definitely not protect you against that. The question now is like, should we be using Sec Comp in this way in the container ecosystem? Why is the container platform basically responsible for the poor kind of state of lens, kernel security and the fact that there are container escape vulnerabilities in Linux and why isn't that the kernels problem? And generally I think the answer is that we do want efficient isolation without going into using virtual machines for everything. It's actually relatively the number of container escapes has been, exploits has been not too bad over the years for most people, this level of security is actually kind of fine. And also most of our applications don't use the whole Linux, this call space. Most applications use a kind of a narrower subset that doesn't include, you know the sort of specialized things you get in Linux, doesn't most people's code doesn't run, most application code doesn't run BPF, it doesn't run username, namespaces. Those things are being used for security critical applications and often for control plane applications but end user applications basically just use networking and storage. I mean, some people would say they should just use the POSIX subset and the Linux call space is just way too expansive. I mean, I think there's arguments about what the boundaries of what normal applications should care about are but and this does change over time with kind of performance reasons that for using different system calls and so on. But there is actually a kind of set of things that most applications don't use and it's sensible for us to isolate them off for security because the common system calls are basically mostly most of the time other than that CVE I just pointed you at generally are actually safe. Second was designed that every application would write its own profile but this is really and really, really difficult for user to do. It was not designed for kind of platform administrators and like if you read the documentation we're kind of doing it wrong in the container space but it's actually too difficult for end user applications to use and you only find very, very specialist applications things like, you know, Firecracker use it and a few other things but the number of general applications that actually have second profiles is really, really small and it's incredibly difficult to use for that function so I'm not really surprised. I'm going to go through the things that we could do choose your adventure, what future paths could we take what should we do in this space? I think it's definitely the case that things need doing I'll talk about whether they will be done later. One option is almost the status quo really is that almost no one will use second especially with Kubernetes is it's optional. There's a few large companies I know who take it very seriously and think it's important. Docker users got it by default but gradually as people shift over to using, you know, Kubernetes directly and things like that where even if you're using Kubernetes with Docker Kubernetes disables the Docker second policy. Docker is mostly now a development platform so I'm not sure it makes sense for Docker to enforce it anymore. If you're not going to use it I recommend you update your kernel weekly. That's a burden. Maybe using set comp means you can do it less often than that. Maybe you get a higher rate of zero days but the rates relatively low maybe you can live with it. I suspect that a lot of people are going to just continue to ignore it and just live with the vulnerabilities. I don't think we could actually rationalize the policy. We went for and allow this, not a block list at the beginning because of the whole issue. I mean, it's the recommended thing with set comp. It's the recommended thing with most security things. Just, you know, you know what's safe, you list what's safe and then everything else is blocked and so if there's a new dangerous syscall added and many arguably the new syscalls often do have security issues more than the old ones then you're safe. However, the list of things we block is now quite small and writing the block policy is much easier and it's easier to understand. It's less likely to break something when new safe syscalls are added like the time 64 ones for 32 bit systems which were, you know, these things, it turns out that there's new safe syscalls added a lot of the time partly because of stupid things like there weren't enough flags allocated on syscalls and there's now new syscalls with more flags being added for everything, things like that. These block policies would be easier to understand because you can see what they do rather than try and work out the negative of what they do. They wouldn't be 800 lines long. They would be maybe 10 lines long so we could actually inline them in the yaml which would have to obviously change the Kubernetes sitcom format again to do this but you know, I think this would be kind of a nice review could say allow BPF to remove the default block list approach to BPF and that would be the one line you need allow BPF allow open perf event, that kind of thing. So I think that would be easier to understand. There would be less maintenance work, wouldn't get people complaining things don't work and needing to suddenly fix them. A lot of these problems have been like cross architecture problems with architectures have different syscalls and new changes and people running a distro that expects one kernel on another and it behaves differently. So those kind of issues could be improved. The error support would probably be better with things like that. So the downside is there's potentially the block list the default block list goes very small because everyone decides that these things are okay. So but I think this is attractive and I think it's definitely worth considering. We have a huge problem in the Kubernetes ecosystem about whose problem is this should users really have to understand about set comp? No. Should applications have to understand it too difficult? Is it, but should it be done at the Kubernetes level where you have to configure it now or should it be the responsibility of the CRI or run or the sort of run C type layer of the actual container runtime? Currently we're pushing responsibility up to the user which is kind of terrible. We why don't we have runtimes that provide actual security guarantees instead of just letting you configure it and making it your choice? We are starting to see some of these runtimes. I mean arguably GVisa, which I talked about in a second and VM runtimes are basically trying to make better security guarantees. But why have we pushed down this whole idea that you list a bunch of CIS calls in JSON which is or CIS call handling rules in JSON which is what we're doing. It's not a good design and there's definitely a layering and responsibility issue that we need to solve. GVisa is a really interesting response to this. It basically re-implements large portions of Linux in Go. Basically it kind of has a Go TCP stack and everything it basically says well, Linux wasn't very secure. We kind of re-implemented it in Go in user space in a memory safe language and then we're going to wrap this up. We're going to use that comp internally in it just to make, because it doesn't actually use many CIS calls. It has a performance hit and it potentially has compatibility hit but it's just like cut out all this security issues from Linux as a solution. It's a really interesting solution and definitely worth looking at. Something that I call the Lambda-like solution. My job is Lambda kind of solved a lot of these problems by having a very restricted as a container runtime. It's strict. I mean, people don't think of it as a container but it's a very similar like a problem space. It uses set comp. I haven't actually probed this policy to see what exactly what it doesn't doesn't allow. It has a custom Linux kernel with features removed which is what a lot of people who run secure systems do is just disable a lot of parts of the Linux kernel. This has thousands and thousands of subsystems that are not generally very secure and you can often access them from user space by opening weird kinds of sockets that you don't really use in practice much and things like that. But the Linux distros are very general purpose and they tend to ship with a kernel that does everything has everything as a module loads anything because you know, the general purpose you might want to do anything. Not many people have locked down this kernel configs. Again, you probably should consider doing this but everyone's kind of lives with their vendor conflicts for support reasons and stuff like that and the vendors are not actually necessarily acting in your security interests all the time. Also in Lambda, no applications can run as root full stop. It just doesn't allow it. Again, in container space, we haven't forced that. Again, it's left to the user to enforce that. You know, there's a restricted runtime API. Most of the file systems not writable and things like that in Lambda. We could do something very much like this. We could have a container runtime that made these choices and had a clear delineation of what you can and can't do and a security model and testing and things like that. In a way, the sandbox flag proposal for Kubernetes is kind of like this, but it doesn't define any kind of specification. It's a bit kind of, some things can be sandboxed and some things might not be, but then they can decide what sandboxing is, but there's no kind of Linux runtime that makes these decisions. In effect, the things like the Firecracker container do effectively kind of make those decisions for you. Ish, but there isn't a kind of normal container runtime that does that. As of Linux 5.7, there's something that we've been talking about for a really, really long time that got merged with the EBPF LSM. LSM is the security module. These are things like SE Linux and App Armor. But SE Linux and App Armor give you a kind of general purpose way of configuring these security hooks in Linux for general purpose systems. The EBPF LSM basically says you can inject at each of the decision points in the kernel where it decides can a user do or not do something. And there are a lot of these points. They're much more than just at the syscall there. They're all sorts of places like, and you get much more specific information about what's going on at these places as well than you do at the syscall ABI. You can basically run an EBPF program that can make real programmatic decisions about can this application do this at this point and it can maintain more state and can basically have much more information to make these decisions. This is not simple. I think as I said, it's a startup size problem potentially or perhaps an NSA size problem. I think the NSA wrote SE Linux in the first place and basically defined the kind of shape of what it looks like. And the NSA is an organization that's interested in this type of problem. But yeah, it's the sort of thing that you could potentially do with a medium-sized team and work on this problem for a few years. This is very much looking at it as a technical solution. It doesn't solve the human problems of really what kind of policies do you need to enforce and what kind of model you've actually got here and how does the human communicate intent over this problem, things like that. So there's still a lot of human problems that you have to solve there. So what is going to happen with SECOMB? My prediction is that if you look at the state of the container ecosystem now, there's a continuing lack of investment in the low levels of the stack. There's really not many people working on these problems. It's not clear who's going to work on these problems. Most people seem to expect someone else to do it and not get involved themselves. I think setting up an eBPFLSM container security startup is probably quite easy to get funded, but the other options might not even happen. The serious service providers like the cloud providers are basically just using VMs. And so that's why you're seeing quite mature things like Firecracker. The problem is that most other users of containers are running their containers in VMs already. They're either running VMware on-prem or they're running in cloud provider VMs and not cloud provider bare metal or other bare metal. So most of them don't have the option of using VMs for containers at this point. So even though their stack is becoming relatively mature, most people are simply not using it. So we'll probably see the split where more and more people are using provider services, cloud provider services, and they will just use VMs, VM-based containers via providers or things like that. The Fargate containers of Kubernetes and AWS, things like that, which are all VM-based and just scale as containers rather than as hosts. We're not seeing security vendors solving this type of problem. There's lots of reasons for this. Thomas Dalyan's done a bunch of talks on it, which are interesting. Talk to me if you're interested in the question about why the security industry isn't interested in this kind of problem. And end user, as if Kubernetes is finding it difficult to contribute back to this sort of problem because these problems are quite technically difficult. There's not a lot of people who have the right kind of expertise around the Linux kernel and how things actually work in container run times and have time to work on these problems and have other pressing problems to work on. It's kind of difficult. I mean, even very large end users tend to have very little Linux kernel expertise and they tend to work mostly higher up in the stack. So we're not really seeing much at the moment, much end user contribution to solving these problems. So I'm actually not optimistic that a lot of this stuff will happen quickly. It's taken a long time to get where we are and there's been little investment in it. So thanks very much for that. And I'll be around to answer questions and hope you enjoyed that talk. And it's been fun working on Setcom but I think there's a lot of interesting problems that could be solved and actually aren't being so. If you're looking for a fun problem to work on, it's absolutely open for doing that.