 Howdy, my name is Kase. So as mentioned, this is about Setcomp, which is not an LSM, and all the small LSMs. There's going to be a little bit of background to distinguish these, the two, and I'll get into a quick overview of Loadpin, and Yama, and finally Setcomp, which is a bit more complex than the others. So as you've already seen, there's a bunch of the larger LSMs that exist in the kernel. And the way I sort of look at it is they have, you know, a comprehensive policy, their full-blown mandatory access control systems. They mediate all kinds of different things, files, networks, et cetera. And the small or minor LSMs have a really, really narrow policy, they're usually a fixed policy. So the first of these is Loadpin. Easy to turn on. So the main use for Loadpin is if you have, if you think module sigforce, which is your resulting kernel will only load modules that have been cryptographically signed, it will refuse to load anything else. If you feel that that is redundant in your environment, Loadpin is for you. The idea being that if you already have a trusted file system that you've cryptographically signed all the way down from your, you know, root of trust, you have all the bells and whistles that you've heard about in some of the other presentations, you know, with various signing and other things, you probably don't need the kernel to have, you know, you don't need to waste the storage space of signatures on the modules and waste the CPU space of checking the signature on the thing you already checked at a file system level is correct. And so Loadpin simply makes sure that what you're loading from on the kernel side actually came from the file system you have cryptographically been checking in some way. Chrome OS uses this with DM Verity because DM Verity has already done the block-level analysis to say that everything's okay. And the benefit of having this as an LSM hook is you actually Loadpin can check things beyond just modules. There's a lot of other stuff that the kernel reads out of file systems that it uses for its own purpose. For example, you know, K-executing to an entirely other kernel, loading firmware into various things, security policies, certificates, there's a whole bunch of stuff that the kernel uses for itself that it gets off of a file system. So there's no reason to do a verification, you know, do an additional signature check when you may have already done that in the past already. Loadpin is pretty stable, perhaps because only Chrome OS is using it. I'm not sure. I haven't heard from any other people. Having been pretty stable, I poked a couple things into for 4.20. As part of some of the stacking work that Casey talked about, it became obvious that there was some terminology confusion in Loadpin. As Loadpin was always enabled, it just wasn't paying attention. So it was doing some tracking in the background. It was the idea of enforcement versus enabling. So I borrowed the enforced terminology. And then I finally got frustrated enough with some of the error reporting on it that I went and made sure to figure out how to get a readable, you know, human readable device names out of the reporting. So a quick demo that has lost all of its colors. What in the world? One second. No, they're here. It's in the preview, and now it's gone. That sucks a lot. Hold on one second. I'm going to see if I can figure out how to fix this in a different way. Like I'm not crazy, right? You can see. Yeah, those colors are there. This is great. I had to update this morning for that. I know what I'm going to do. Yep. I'll use my own link. I don't want to read from there. Hold on. Sorry. Is there a present mode? I'll get to do this blind. This is going to be awesome. Yeah, I know, but that's really small and far away. Don't want you. Yeah, I'm sort of there. Okay. It is really helpful to be able to see. Okay, load pin demo. Back to this. This is a quick view of, say, slash temp in red and slash trusted, which let's say you've been cryptography checking and you go and do an ins mod on some module out of trusted. And if you look at the dmessage output, this one happens to say, hey, it turns out the backing device for this file system is in fact writable. So I'm not even going to pretend that this is an enforceable situation. But for testing, sure, we'll pretend that it's loadable, but you can turn this on and off. The idea being that you can do a testing of your load pin stuff in an environment that's writable. But once you actually have a read-only backing block device, it will not print that out. So then it reports in yellow what type of thing it's trying to load and says, yay, I have pinned this module. I'm now going to trust trusted. And then you can unload the module and then you're going to reinsert another module out of slash temp. It says that kernel module has been denied out of temp. I don't like you. And then if you unmount whatever file system you had been trusting, load pin says, hey, you took the file system away. I have no idea how to find it again. So I'm never going to let you load anything else again. Anyway, that's sort of the rundown on easy load pin things. Yama, this was the first of the stacked LSMs. So sorry, not sorry. And it narrows the scope of the P-trace access checks that were in the kernel because it used to be quite a wide open thing. You could just P-trace anything as your user ID. The basic goal here was to increase the amount of time it would be necessary for someone who has broken into your device to steal all your stuff. Because if you have password protected things, someone's going to try to drop malware so that you would run, you're going to run GPG or something and they had replaced your GPG, but they have to wait for you to actually do those things to get at your credentials. Whereas if they can break in and P-trace everything on your system, they can pull credentials out of running programs, they can attach to your SSH tunnels and jump down to the next machine that you happen to be on. The P-trace is kind of scary in that you could just do anything as that user and expand the scope of your attack immediately without having to wait for a moment when you would take advantage of some user action. You would trick someone into doing something. So it's certainly not a silver bullet, but it does change the access controls on how that's supposed to look. This is also pretty stable and most distros are using it and of course I had to write that and Syscaller who has been running for, I don't know, how many hundreds of thousands of hours in various places just sent one bug that it hasn't been able to reproduce in four days. So there may be an RCU bug in here and a couple people are helping me look at it. And I may have some future work cut out for me because Jan Horne, who likes breaking the world, has started looking at Yama. So demo here, again, colors are very important. So as I said before, before Yama with standard DAC access checks on P-trace, if you are the blue evil attacker, you can get at anything that is the same user ID. This is sort of an example LS tree output. So the first two are UID root and then everything else is me. So as that attacking process, it could P-trace into anything owned by me, short of things that had explicitly made themselves undumpable, but very few things do that. This isn't great because it can go jump around and pull things out of secret and not so great. Now of course with DAC, it can't go read system D or some other user stuff. So with Yama, all that goes away, can't get into anything because the check that the P-trace Yama will do now, as it says, all right, let's look at secret and I'll walk up the ancestry tree. I'll say, okay, is Bash trying to P-trace me? No. Is system D trying to P-trace me? No. Okay, I'm not going to let this happen. However, this continues to work in situations where you're debugging stuff. GDB launches some program and says I want to P-trace that program and Yama starts walking up the ancestry tree and says, is GDB my parent? Oh yes, I am. Okay, this is allowed. But this situation would break crash handlers because you'd start some program, say Chrome, it would crash and then the crash handler would actually fork and exec a handler to deal with it and connect back, which would break because Yama would walk up the ancestry tree and say, is the crash handler part an ancestor of mine? No, it's not Bash, it's not system D in this example. So that wasn't great. So we had to add some whitelisting that the crashing program could declare just as it's beginning to die. Wait, wait, this guy just forked. I want that to be able to P-trace me. So you declare a whitelist and then the crash handler would sort of be implicitly part of your ancestry tree that was doing the checking. And there's SecComp. So this is not an LSM. LSM is, as discussed earlier, are sort of hooks into the kernel in lots of stable places where you have access to all the resources you want to be checking. It doesn't cover all CISC calls because not all the things you want to mediate from an LSM are available there. But there's a lot of things you don't necessarily want programs to be running, so there needed to be a way to filter at the system call level. One problem with this, of course, is you could, you know, if as a malicious person trying to trick a setUID program into doing something dumb, you could, for example, convince it that some program had in fact successfully dropped its privileges when it didn't. So if you were able to filter a setUID program, you could convince it to do bad things. So no new privs, the no new privs bit was invented to say, well, if you try to exec something with a filter, you know, with no new privs set up, you can't actually get it to be setUID. So you have to either be an admin or set no new privs so you can't actually trick anything into doing this. So that was a problem with SecComp very early on in the design that got added that's used pretty widely with SecComp. So this is in all kinds of things. I went and looked at another list and was terrified to see how much stuff is using SecComp a lot. But it's pretty easy to add. So the examples I'll be using are with MiniJail, which was built for Chrome OS, but we use it in Android now on a lot of things. And it's sort of a collection of all the different container options you can pick and choose. But if you're going to do normal filtering, I strongly recommend LibSepComp. Makes things a lot easier. And if you want to do really special things, you probably need to learn BPF. And I note that this is actually a subset of classic BPF. It's the SecComp portion. And it's definitely not EBPF yet. But the link off here is for Michael Karris. Great presentation he did a couple of days ago. He gets really into a lot of the details. So the basic bit of SecComp is the BPF is a Berkeley packet filter that was trying to sort of have a way to look at packets in memory in the kernel. So there's this idea of, hey, I'm getting off cycle a little bit. Okay, it was looking at a linear bit of memory. So instead of looking at a packet, you can instead look at the details for a syscall, which system call number it is, what architecture it is, instruction pointer, and the basic arguments. And you can't actually read the memory that any of these arguments might be pointing to because you would be later racing the kernel who would try to copy that same memory out. So there's sort of a standing problem with that. So SecComp only looks at the actual pointer values. Once you make the matches you want, you can actually spit out a lot of different things for the filter to do. The most severe thing lower on the list is what will win if there's multiple filters running against the syscall. So either you can allow it, you can log it, you can trace it. If you have a ptrace manager for the process, you can have the syscall get skipped and fake an error for it. You could use that on open instead of killing the process. You would say, oh, that file is just not available and the program will continue running. Trap is to actually deliver a catchable SIG-SYS signal. If you want something, if you want the program to deal with it in some way. And if you say kill thread, it'll kill that specific thread of execution. If you say kill process, it'll take out the entire thread group. Recent developments, Tyco has been working tirelessly to get user notification in. There's sort of a problem with the trace return here in that you may have a manager say inside, say someone's running GDB or you're running a different net or something that's using ptrace to do normal operations and you want to use this for a container or some other situation. You can't really do a trace because it would be already being traced, but you still want to get information out of filters and sort of block and do things. So trying to get away from using ptrace to get notifications about when things are happening at a sec comp. At the very least, we know the things it won't be. That's what we've got out of the discussions. But I think there's more to happen there. And for quick demo across a couple pages, I'm gonna try to slightly get this lined up. No, never mind. So I'm using MiniGL as a demo quickly. What it's got here is dash s I mentioned earlier is for specifying a second policy. And that's in some file called cat.policy. And we're gonna just run the program cat to print out the policy. So I can have a self-documenting demo. And this fails immediately. Okay, I got a Singapore out of it. Why you go look at syslog? Cuz that's where MiniGL puts its information and says, okay, some permission denied. What error code is that? Let's go look at the man page for sec comp. Error access says either I need to be cap sys admin or use or set no new privs. Well, I'm not cap sys admin here. Is there a way in MiniGL to get to new set no new privs? Yes, it's minus n. So okay, I need to add minus n. Now it'll run. And this is a basic policy of the minimum set of sys calls needed to have cat actually run. You need to open at and fstat, mmap. You can read all those. But I mean, they're a relatively basic set of sys calls to look at a file, open it, print it out, and close stuff and clean up. Now if I change a little piece of this, and for the open at, instead of saying it's allowed with the one that's on here, I can actually say have it return Ian Val. So this is the return Erno piece. And I run it with that policy and you can see immediately that it tries to go along, do its job, hits the open and the second policy gets in the way and says, no, that's invalid, matching what I asked it to return with. So okay, what if we wanna change what we're running? What if I wanna run LS on this file? What happens? Okay, so I get another error and I gotta look in sys log again and it says okay, I died with signal 31. Okay, yes, that's sig sys. It was killed because it ran some other new thing. How do I figure out what system call it was actually using? Let's use strace and I see it starts to try to run IO control and is immediately killed by sig sys. It's like okay, I'll just add IO control and I can get fancy now and actually check arguments. And I say I want the first argument to be matching what's here for my standard in, out, out. And you just sort of repeat this cycle until you get all the sys calls that you think you're gonna need for your program. This is the basic way to go through it. That's it, those are the three. So yeah, catch us up on time a little bit. Questions? Is it a question? Could you say something again about the race that you said about the referencing the pointers? Yes, so here I'll go back up. So what you get in this, so for example, let's say you've got a management process that's watching some set cop program and you get a ptrace trap out of it. You're just like okay, I wanna look at this open and I wanna figure out what file it's actually asking to be open. So you look at the argument for it and you say okay, the first argument is the pointer into the user space memory of this process. And I'm gonna go look at what file it is and says, I wanna open Varlog syslog. You say okay, yes, I approve of that. Let's go ahead and let it continue and you release the process. And right after you release it, some other thread of the process, some other malicious thread has been sitting there waiting for some signal and it quickly changes what was written at that location. And then you hit the kernel and the kernel goes and reads out that memory out of the user space memory again. And it's completely different. So it's a total bypass, a relatively easy total bypass of any kind of intercepting those things. To fix this requires either interaction with the LSM layer to get it the information once it's been processed by the kernel. And that's sort of what Landlock LSM is looking to do, because you can have a programmatic way to describe an LSM policy. Because right now most LSM, like the large LSM policies are defined by the systems administrator, not by the authors of a program. The other way to solve this is to completely rearchitect how the kernel processes arguments, which maybe will happen. It would be an interesting defense against some aspects of cache timing attacks and other things. But it would be a lot of work to have effectively a cache of the memory you want to read out of user space. You sort of declare, I want to be doing these things. These are what these arguments would mean. And in a normal sys call, the kernel would go off and say, I don't have a cache of this yet. I'll go copy it into my memory and then I'll use it. And if you had seccomp in between here, it would go do that copying first. And then if you continued the sys call, it would use the cache copy. So it's an idea to try to minimize the race or eliminate the race for that kind of thing. More questions, if not, let's thank speaker. All right, thanks.