 Perfect. Hello. I'm KP. I'm going to talk about the BPF LSM, which is something we introduced in Linux security summit 2019 in San Diego. It's already upstreamed. And we just because this is the last talk of the summit, we're going to hear the story of how it got upstreamed and then we're going to walk into some technical details about this stuff. So a long, long time ago, well, it was sometime back in 2019, the security analysts at Google came to me and said, we need some audit logs. I said, can't you use audit? And no, audit doesn't have the data I need. Okay. I got, I'll modify audit. I'll patch the audit in the kernel. I'll update audit CTL and all the user space stuff and all our infrastructure that treats audit files. I want to use this data to prevent something from happening. Okay. Do it again for all the LSM layer. And then I just told my boss, we just need to do, we just need a new way to do security in Linux. My boss was like, oh, do it. So I said, I'll use the BPF and I'll use it with LSMs. And one of my friends here who's left for the talk yet, but said, you should call it KRSI. And this kernel runtime security instrumentation was henceforth PON. And we're off taking a plane to Portugal. I'm sitting on the bus with Alexei Starovoytov and Alexei tells me, the only way a BPF-based LSM can land is if both Landlock and the KRSI developers work together and they work on a design that solves all use cases. And he said, oh, by the way, KRSI is a great name. It has got a nice rhythm to it. So we went to the Landlock folks. I've personified Landlock as a lock here. Landlock says, we want unprivileged sandboxing. We have a discussion. This is now happening in Portugal and Lyon, the Linux security summit. Unprivileged EVPF is still quite some time away. And this was reinforced by a lot of people, especially Jan Hon who works on speculative execution stuff. So Landlock and the KRSI folks entered into a truce. We can do something else. So this is the presentation that you saw from Mikal previously. And now the battle ensued between security and the EVPF subsystems of the kernel. BPF likes performance. Tracing has always been performant. But there were discussions on the mailing list. And these are in verbatim from the mailing list. The LSM mechanism has never been zero overhead. And it shall never be. I don't give a flying fig. And this is my representation from the fig here. And then Alexei got tired of all of this. And on the next email on the mailing list, and a nervous me looking at all the list was, I think the key mistake is that we called KRSI an LSM. We should have never called that. We should have done everything in BPF. And I'm like, no, there is value to that LSM surface. We entered into a truce between the BPF and the security community. I call it the treaty of impotence which was cited by the Linux weekly news. Land a slow BPF LSMs and make all LSMs faster. So that was a good thing. It got us going and building our solution using BPF. And we will make all the LSMs faster. And KRSI was merged as BPF LSM. Alexei said KRSI is not going to be in the kernel source code, which is fine. I was not in particular about the name. So we are going to talk about how we use the BPF LSM and what is BPF and how does it get loaded as a refresher for everybody here in the audience. So EBPF is this byte code which is verified by the kernel. The verifier says, okay, the program holds. I have approved it for execution. It checks for memory access. It checks for a lot of other things. Even checks for, tries to check for speculative execution. Then there is a JIT which converts this byte code into native code. So in this case, I'm showing X8664. And at that point, the BPF JIT code just looks like an LSM hook as it would normally be in the kernel. It can understand all the arguments like an LSM hook and do all the things like an LSM hook as well. So we use something called trampolines which are after this trampolines and in collection BPF trampolines. A lot of internals there but it is just like another X86 native code LSM hook there. So how do you write a BPF LSM program? So you have this elf section that says, like, I am an LSM program and I want to attach to file and protect. You read the LSM header file and you look at, you give the program an example name and then you pass all the arguments. And this program currently says, I'm happy with everything. I return zero. You can use BPF helpers to collect more information there. So you can say I want the paid. So you have a helper that says current paid TGID and you can get that. But you can also do something more fancy. You can say that I want to access the VMA start field of the VMA structure and the compiler, the client compiler that compiles BPF is smart enough. It emits relocation information here in the BPF byte code that says, hey, the program is intending to access the VMA start field from the VMA struct. So whenever you load me into the kernel, please find the right offset for the VMA start field. And that is pretty cool. It makes the whole BPF program compile once, run everywhere and portable. And then we also added as a part of our LSM implementation or our security work, we also added atomics to BPF. So imagine you want to count all the improtect calls. You can do like add one. And then you can make a policy decision. In this case, it's a very stupid policy decision. So you can only have 100 improtect calls in the kernel. So now we're going to go to the future work that we want to do and into some more complex areas. So this is going back to the original treaty of imprudence that we signed. We want to improve the performance of all LSM hooks. And to understand why the LSM hooks currently have a penalty, it's because of something called indirect calls. Indirect calls are calls where you call a function pointer or essentially where the address of the call instruction is not known to the compiler. The address is in a register. You load the address to register and then you call that register. And because of the, if you were here for this transient execution or transient execution attack stock yesterday, you can trick the speculative execution engine of the CPU to speculate around these indirect calls. The speculation window is larger, especially when the address is not readily available. And the CPU is just waiting for the address to be available in that register. So it starts speculating. And there is a mitigation that is called red pulleys that are therefore indirect calls. So what this very innocular, very, very simple sort of function pointer call looks like in the assembly is it goes to, it has this weird symbol that says call x86 indirect thunk rax. And this x86 thunk rax is actually a sequence, it jumps to a different part of your, of the kernel executable. And then it does this fancy thing. And here what it is trying to do is it calls something called set up target. So CPU starts thinking, okay, at this point, it's going to come back to this capture spec, right? And it starts running this infinite loop in its speculative execution engine. And then it set up target, it does something clever. It moves the address in rax to the stack pointer and does a red instruction after that. And the speculative execution engine was trapped in thinking I'm in an infinite loop and suddenly it gets into this new place. But all of the sequence adds overhead. And imagine all of the sequence getting added in every place where you are in socket send message or everything. So these LSM hooks permission checks are done in some key hot parts in the kernel. And having these indirect calls in the hot parts in the kernel is very bad for performance. The other place where there is performance impact is these empty callbacks. So BPF LSM as a part of the treaty of impedance was we added these empty callbacks where there are these functions. Normally there is no BPF program attached. So you return a default value. And these are called every time on all of these hot parts even when there's no BPF program attached. So what, how do we fix this? As it turns out that a lot of other parts of the kernel have already seen this as an issue. You can look at the KVM source code, you can look at the network source code. They don't like these indirect function pointer calls in these hot parts. So they've come up with this API called static calls. And what happens is we don't need to really do the indirect call in the LSM framework because we know that all the addresses of all LSM hooks at compile time. The only reason why we do indirect calls is because the order can change at boot time. So you have this LSM equal to command line kernel parameter. And since we want that flexibility, we have to provide some dynamic ordering and at least a function pointer call. So what you can do with the static call API is you can have these N slots which are initially filled with Nop instructions and then at boot time, once you know the order, you can patch them with call instructions, a direct call instruction to the actual LSM hook implementation. And that changes your indirect call which has this extra penalty to something that doesn't have a penalty here. Some implementation details. So this is for reviewers. We're going to look at these patches. We're going to come out on the list. We sent an RFC sometime back, but we've been iterating over them trying to measure improvements. There is this data structure called the security hook slot which has a couple of things called key and trampoline which are needed to define a static call. There is this security hook list data structure which has a legacy name, but I didn't bother to change that name. We could if we want that right now. This is what gets initialized by the LSM hook in it. And then there is the static key. Static key is, again, it's a branch that tells the CPU that, hey, the static call is empty. The static slot is empty, so I will change what is between that to a knob and we'll jump to the next instruction. So these are called static branches and, again, they are hot path optimizations in various parts of the kernel. So then the other implementation detail is what information gets provided by the LSM itself. So it says that this is the slot I'm going to be assigned. This is my callback, this is the LSM that I belong to, and this is the default state. The default state is kind of important for the BPF LSM because at BPF LSM, the default state is disabled. And then we extended the LSM hook in it macro to have LSM hook in it and LSM hook in it disabled, where BPF LSM would call LSM hook in it disabled because initially you don't have any BPF program and only when you have a BPF program you want to have that static key that guards the slot being active. There is some fancy macro magic that makes all of these static calls happen. So all optimizations are not free in terms of code complexity, but given that LSMs are in places where you really are talking about hot paths like sending a packet or opening a file or writing to a file, it becomes worth it from that perspective as per me. So you have the security for each static slot which unrolls for loop and then it goes into this section which checks whether the static key is enabled and then actually does the static call and this is essentially your redefinition of the call into hook macro if you look at the LSM source code. And the X86 Red Bulleen thunk RAX now changes into these direct calls. These direct calls are call to the static slot. The slot has the call instruction pointing to the actual LSM hook implementation. So it changes from an indirect call to a direct call magically with some macro magic and some runtime patching. So what we notice is that there's some feedback given to us initially on the RFC that we measured a CIS call in a tight loop. But Redis benchmarks, there is an open source tool called Redis benchmark. We run this HaveSE Linux enabled and we noticed that it improves performance by 3%. There are some limitations for now. There are some of these hooks that don't call the call in and call void hook macros. We've not gone and updated them because each of them need their separate for loop unrolling macros. We can do that, but these are not yet in the hot parts of the kernel. They're typically LSM blob management hooks like SecID to SecCTX and stuff. But again, we can implement per hook macros for these as well. There is more future work here that needs to be done with regards to unintended side effects. These default callbacks that the BPF LSM adds, we thought that they had no side effect. But they actually do end up having a side effect on some corner case reasons. So look at this function called security inode SEDX SATA. There's a comment there that says, SE Linux and SMAC integrate the capability check call, so assume all LSMs supplying that to so. Actually, BPF LSM doesn't do that. And it returns zero as a default value on the call int hook macro. And if the default value is zero, the capability inode SEDX SATA check is omitted. And what you can do is you can overwrite security XSATAs as an unprivileged user. But if you have SE Linux enabled, this doesn't happen. So it was not as bad as it could be. But you can still vandalize the system and you can overwrite in security XSATA and the security dot capability on slash user bin ping, and it returns a non-zero error code, which is how we notice this. This is reported by Yarnhorn internally. So we need to fix these side effects and we send a patch, try to fix that. The bottom line is that this LSM red default in this default callback can lead to subtle corner cases. This was not the only issue we noticed. This was pretty serious. The most serious one we noticed, but there were other side effects that we noticed, is security to SEDX default value actually broke audit, right? I not copy XSATAR similar case where we return a default value that was incorrect cost crashes. So a few ideas. A few ideas, we could have a new error code that is ENO decision, or use an existing error code that signifies that, hey, I'm an LSM and I don't have a decision to make given the information I have. This is helpful if also if you're in a BPF program and you're only doing auditing and you want to return a value and say, okay, I don't want to bother with a policy decision or I don't have the right information for a policy decision. So, or you could use static keys to enable or disable LSM hooks, which the proposed patch for performance improvement already does, but it doesn't solve the BPF program returning ENO decision case where it's only going to do audit. I mean, you could look at the code more carefully. You are crafting an LSM at this point, right? We should be more careful, but we made mistakes even though we were trying to be careful. So it may be better to fix it at a systemic level here. The other future work that is going on currently is program signatures. So we want to sign BPF programs and there's going to be a buff later. We're going to discuss BPF program signatures and IMA. But the thing that we want to make sure or the goal is that you're compiling a BPF program, that build chain for your BPF program is trustworthy. I can trust where this BPF program is coming from. There is a dilemma here that you can't really sign that object file that you generate at compile time and this is because you're accessing struct fields and they're modified at runtime, you're setting up all these maps. The actual instruction buffer gets modified and your signature changes. Your signature is not stable there. So if you try to verify that signature in the kernel, you're going to get a different signature. Your verification is always going to fail if you do that. So this was BPF folks tried to solve this. They implemented something that is called a loader program. A loader program is actually the trace of all the operations that are performed by BPF. So the instructions of this loader program is the instructions that represent all the syscall operations and then the instructions of the original program itself. And the overall instructions loader is then stable because all these modifications that happen in the kernel as BPF loader program there. So you can sign that instruction block there. BPF syscall doesn't know about your file at all. BPF syscall only knows about the instruction buffer. There's no file argument to the BPF syscall there. So how does this look like? You have the program, the loader program, the signature, and then you can, in the BPF Prog Alloc, you can write this LSM gatekeeper program, which is again a BPF program, but you can have this BPF program be a part of the kernel itself or load it in a trusted stage. And then there's people on the BPF mailing assist where we're trying to implement this BPF helper, that is BPF verify PKCS7 signature. So the LSM gatekeeper program can also have more runtime policy information. I was trying to understand from the key chain talk as well, but this is where you can sort of have a runtime policy. You can say that BPF trace, which generates BPF programs at runtime, this is the allow listed hash for that that can run unsigned BPF programs there or signed by a different key or something. That's where your flexible policy gatekeeper engine sort of lives in the kernel. We will do another buff, and I'm just running short on the future work number four. This is the Xata support. Again, that BPF trace use case, we want to just allow BPF trace to load BPF programs. This is very heavy prototyping going on. The Xata patches are on the mailing list right now, and there is no, the domain information or whatever. It's just an example that we cooked up for this talk. So simple domains, BPF domain, security dot BPF Xata can load BPF programs and Xata security dot X, security dot domain equal to Xata can load, can set extended attributes. And then you can sort of see that like Xata minus L user bin BPF trace is security dot domain BPF, right? And then you have, we've added a new helper to the kernel that is the BPF get Xata helper, which allows you to read extended attributes from files and you can use this in different parts of the kernel on all of the LSM hooks. You need to be using sleepable BPF programs, but I go into implementation details. So the LSM hooks that you can use there are the BPRM committed creds hook. You read the extended attribute from the executable, and then you store the security domain information in the task blob. The task blob, we've added helpers to BPF to implement security blobs, which we call BPF local storage. So you can say BPF task storage get, and then you can set your custom enum, whatever domain you want, and you can set that information there. In the task alloc hook, when a task is forking or a child task is being formed, you can transfer the security domain information to a child or you can do even more complicated things where you don't want to transfer this information there. In the BPF probe LSM hook, you can deny BPF's calls for any non-BPF domain tasks. And in the inode set Xata hook, you can deny attempt to set extended attributes from outside the Xata domain. Okay, so that was my talk. I think I finished on time with room for questions as well. Yes. Yes, I think these hooks that we've added, we call them no side effect hooks, but they're really not. If the LSM framework changes in a certain way, you end up having a side effect in some corner case, and then it's either like angry developers or angry parts who find this and it breaks functionality. Yes, yes. Yes, I think we proposed an initial implementation where the LSM was guarded by static keys, but yeah, that was, we didn't want anything in the LSM. We decided that we don't want anything in the infrastructure that is guarded by a static key, so we ended up implementing these default hooks, but we need to have this hook shouldn't run by default. There is no default hook that has no side effect, or we update the infrastructure to handle no decision case for default hooks. Thank you. We will fix the bug. We should fix the bug. So we have now, so the organization I come from was, we have an agent that was using a kernel module and combination of kernel module and audit logs to send telemetry information from our internal machines, like laptops, and we changed that agent to, instead of using kernel module and audit, to use our BPF programs. So now we are using BPF programs for that. We've rolled it out for all our machines right now. I think this was a couple of slides that I had in backup, is that while I say that we do mandatory access control and LSM and auditing need to go together, they actually don't because sometimes you want to do a policy check really early on, and you want to do an audit really late in the life cycle. So we would ideally, if we come up with an agreement that the LSM security equals to access control plus telemetry or plus information, then we can add new LSM hooks. And we have a few use cases where we would like to propose new LSM hooks as well. Thank you.