 As James said, this is a second update. Looking through the dates of the Linux security summit from last year to this year, it looks like this covers basically 4.3, which hadn't yet been released, but I talked a little bit about last year through 4.8, which, again, hasn't quite released, but has some interesting things in it. The slides are there if you want to grab them. They're also up in the various places to get slides. So if you're not already familiar with SecComp, it's basically a programmatic way for processes to reduce the available attack surface against the kernel if they worry that they're going to become under attack. It's used by a ton of stuff. The list is getting longer every day. Probably the easiest interface to working with it is LibsecComp, which will do a lot of things for you as sometimes it can get complex if you have long or complicated filters. So stuff in gray is in the past, but I like to keep it up as a reference. Over the last year, we've gained more architecture support. PowerPC, Tile, user mode Linux, and PA risk all went in. Some of these required some extensive changes within the architecture just to support the being able to change syscalls, which gets us actually to the regression tests. There's a full regression test suite to verify that everything does actually work as it claims to, because things can get very strange when you're trying to change or block syscalls for given architecture because syscall entry is very architecture specific. So in last year, we've added test support for the architectures that we just added support for, which makes sense, except I think Tile support for the regression test is missing, so I'm not sure what happened there. One funny thing that I noticed in the commits that I didn't register at the time when they were going in was that for adding user mode Linux support, the test framework was using ptrace register manipulation as part of its tests, and this wasn't particularly well suited for user mode, so Michael rewrote how the test manages the registers, and then when PA risk went in, they said, well, we don't have this register support for our ptrace in the architecture at all, it has nothing particular to do with SecComp, but we just want to make sure we can run the tests, we're going to add the entire infrastructure for dealing with these registers for PA risk as well, so not only did the SecComp support go in, but also a whole section of ptrace support went in, which is nice. And I'll talk about this a little bit more in a couple seconds here, but the also added test for ptrace interactions, normally SecComp has a hook for dealing with ptrace, but you can also be ptracing with separate hooks, and defining the relationship between SecComp and ptrace and that started to become important, but I'll get to that in a second. So last year I talked about split phase internals, and now they're gone. So this was added in 319, the main issue is that a lot of architectures have a fast path for syscall entry and a slow path if they need to do more work with more registers or they're going to go do ptrace work because you need to set up an entire way to interrupt the syscall entry. And Andy Ludomirski looked at this on x86 and said, man, we're hitting the slow path so much, even if I have a filter that has no tracing, I should be able to get through this really quickly. So he spent some work splitting up the logic on how SecComp works, and that was nice, except then he went on and cleaned up x86 slow path to be much faster, and then it sort of got lost in the noise, and when I did experiments on ARM, trying to do a split phase to try to speed things up on ARM, it really didn't gain all that much. So ultimately it was decided due to the complexity that this added to SecComp, which strives to be very, very simple, so it has fewer chances for bugs, we ripped it out because of the complexity and it wasn't getting used since it was introduced in 319 by any other architecture, and when I tried it with ARM it didn't really look all that great. So I think probably the better approach for people interested in speeding up SecComp filter processing on ARM is to look at the BPF JIT that's in ARM. You could probably use some updating as well, since I don't think there's an EBPF JIT right now in ARM. So getting on to ptrace ordering. So as originally designed in SecComp, the idea was you could have a monitoring process that was ptracing, you know, the filtered, the confined process, and you might block a syscall and then the monitor would decide, well, actually I want to change it to some other syscall and I'll have it do that, even though that syscall is blocked by the filter normally, I want to actively bypass the filter and do this. That feature was never used as far as I can see or find or ask about. No one said that they were using it, I haven't found any counter examples yet. Unfortunately, this left a gaping hole for effectively launchers, containers for a knit that wants to set a filter on an entire tree of processes. Result is if you're under a filter, you can spawn a child and it can ptrace you and just inject syscalls that totally bypass SecComp, which was crazy. So I looked at this a little bit and I was going to do rechecking after ptrace and Andy again spent a weekend convincing me that if I move ptrace ahead of SecComp, there's no change in the attack surface because I was kind of paranoid about, oh my God, there's so much code in ptrace. I don't want all that running before I get to SecComp because if there's bugs in there, we're all doomed and SecComp has been bypassed. And he finally, you know, broke through my dense opposition and pointed out, yes, but the ptrace attack surface is exposed if any syscall is exposed. So unless you have a filter that blocks all syscalls, you already have full exposure to ptrace. It's like, okay, that's actually a completely good point. So we reordered this. So ptrace now occurs first and one of the main benefits actually is that normal tracing, like if you just strace a process that's running that has a filter, you will actually see it start a syscall and then die. So you actually go, oh, I don't have the correct filter for the syscall, rather than it's going along and dies and you have no idea why and you have to do a lot more work to figure out what's going on. And in the stranger case of the SecComp return trace, which issues a ptrace interrupt in the middle of filtering, if that decides to change any of the syscall parameters, I mean, we don't know if it did yet, but once the retrace returns, we basically just recheck the filter again so that you can't bypass it using SecComp itself to bypass the syscalls. So in theory, we've closed this whole and now things are a lot more usable in four containers and for launchers and everything else because now there isn't a way to issue a syscall that has been blocked just by going off and using tracing to bypass it. This is nice. I'm pretty happy with it. It needs a little bit of documentation changing, but this is going in 4.8. There's one related piece to this, but other changes, the checkpoint and restore support for serializing filters was added and Tycho talked a little bit about this yesterday. They had a lot of very interesting challenges about how to do filter restoration because SecComp filters are a tree of filters that are shared by a tree of processes, so you have to be very careful about the restoration, but ultimately, this support exists so if you've got CRIU in your kernel, you can now extract the filter from a running process and examine it. And in 4.5, a corner case was noted by Jan Horn that if you have a launcher, something running as root, something running privilege that sets up a filter on a process and then it continues without root privileges, you're in the case where you have a SecComp filter that was installed by root, so the no-new-privs flag hasn't been needed to be set, but there's a filter already, so if you come along as a non-privileged process and you try to add another filter, it didn't notice that you didn't have the no-new-privs flag set because the original code sort of assumed that you had already gotten into the filter through some other checks, so this moved the no-new-privs flag earlier, again closing a hole that could potentially, you could end up bypassing your SecComp filter if you wanted to attack some set UID process, you could set up a filter to do that if you were launched already from a running process which was relatively rare but was starting to become more common, especially with containers. So that gets to things, still wanted, I talked about this last year too, deep argument inspection is still heavily desired as a way to augment the type of confinement that SecComp gets you. Right now there isn't a clean way to look at the, you can look at the pointer value that came in for a syscall, but you can't look at what it points to, so you can't do a path name inspection at syscall entry point because it's sitting in user space somewhere and SecComp really doesn't have a clean way to handle this because if you were to actually check the user space memory you would be racing against when it would be later used by the syscall, so that's totally not a secure way to do the check and additionally you'd be reading it twice once in SecComp and once in the syscall, so that's poor performance if you've got a filter doing that. So some possible ugly solutions right now are you can have an LSM that is tied to SecComp and will flag and saying, hey, I just entered open, can you please run the LSM hooks now because I know it's for the open and the LSM already has a copy of the information that was in user space, you can avoid the race and you avoid the performance hit and I'll take a step back. LSM hooks aren't tied to syscalls because they're not about entry points or about they're tied to objects. So for something like path name checking the LSM hook for file open applies to the open syscall, it applies the access syscall, it applies in several places and we don't want to expose the kernel internal hook information to user space because then suddenly we have to support any changes in that. So that needs to be mediated some way by tying a syscall to the LSM hook. But again, this is ugly and painful. This doesn't look quite right. Another solution is to redesign the Linux syscall interfaces to sort of be aware of cached copies of arguments. And this is also quite painful for the same reasons you laughed, but right now Linux when it performs argument handling it's very ad hoc. When a syscall starts it just sort of has a bunch of arguments and it will read stuff as it needs out of user space. Sometimes it only needs read the first argument and then it'll fail out or sometimes it'll do other things and come back and read a structure and parse it and it just sort of does it as it goes. It doesn't take everything in and copy everything into memory because that's pretty inefficient. So teaching the syscall interface about argument types that could be cached that maybe SecComp already copied into some cache somewhere and then later the syscall starts executing and goes, oh, I don't need to copy this. I already have a copy somewhere for me like adding that infrastructure and the type management associated with that also looks incredibly ugly, but right now these are the only two solutions that seem viable for deep argument inspection at syscall entry time. So what you're saying is this is a bad idea? Well, it's a painful series of solutions but it's something that we need to get addressed in some fashion because of how a lot of other confinement on other operating systems look. Comparing things like a seatbelt on OSX, it has very distinct semantics for how you say, I need to allow open of this file. And the policies written are tied to syscalls not to object protection because... Why are we... Because the... So one piece is we want a programmatic filter system because right now all we have is administrative filter addition. So we don't have a way for a process to say, I only want to be able to open this file. There's no mechanism to do that. You can have an administrative LSM policy that applies to it, but a program can't start up and say, I never want to be able to see anything but this file. You can sort of simulate this with username spaces and bind mounts in really awkward situations where you only keep one file visible in your VFS view or something like that, but there isn't a generalizable approach to self-confinement or confining a series of children processes right now. And the user and the API that exists for describing that for LSMs tends to separate a policy language for mandatory access control is then implemented with LSM hooks. So that interface isn't... The LSM hook interface isn't directly exposed to user space because of how an LSM handles its own policy. That's the API that it presents. Whereas if we have a programmatic interface, the only programmatic interface we expose right now is the Cisco interface. And so for programmatic confinement, we need to solve this in some way. This is probably now the number one thing that people say, hey, we need this, we need this. Whereas before it was the P-Trace thing, which was ugly, but Andy straightened me out. Who are you saying they need it? I get it from Chrome folks, from Android, from Jeff's waving at me. If memory serves, I've heard it from LXD folks. People doing containers and running stuff in a tight confined area are interested in it. Could you translate? Possibly. John just said yes, it could be done. Yeah, Jeff? So also on it is discoverable logging. So in theory, a lot of logging needs are addressed by the existing audit hook, but this requires a pre-existing global audit rule, as I understand it. Or you need some sort of heavyweight monitoring process that needs to be, so something that can be examined by a non-admin is what we need. So I'm hoping that we attempted some ideas with signal, throwing signals, but it was happening in the wrong place and the ordering of signal handling versus tracing was ugly. And we've had some requests, not just Android, but also in other people looking at trying to do, to generate sec-comp policy. This is really difficult to figure out what the sec-comp policy needs, but I'm hoping perhaps naively that with pre-tracing reordered, it'll be much easier to see what's going on in a process externally because the tools, the pre-existing tools will work again with a process under a sec-comp filter because you actually see this call and then it blows up. So yeah, those are the pieces that I get a lot of requests for. Any other questions or comments? That's sort of where we are with sec-comp. Cool, well thank you.