 So, the next talk before lunch is KP Singh is going to be talking about the Kernel Runtime security instrumentation. Thank you, James. I couldn't have asked for a better segue into my talk than the previous question, so amazing. I'm KP. I work in Google in Switzerland and on detection and response. And I'm going to talk on Kernel Runtime security instrumentation. We're going to talk about why we are doing this. Then we go into, like, how does it actually work? Also, the alternatives that exist right now, we're going to do a little case study based on some of the prototype work we've done. Added bonus, we're going to have a performance comparison. It's not really a true comparison study at this point because this thing is in prototype stage, but we want to figure out whether we should actually be doing this in the first place or not. We're going to do a small demo, beware of Murphy's law, and we're going to throw away some design questions to the community here. Before I start, and if I run out of time, thank you to everyone who contributed to that. Really, really appreciated. So motivation. And before motivation, I'm going to classify security into two different aspects. The first aspect is signaling, which is really important, bits of information that are not necessarily all bad or relate to, like, malicious activity, but they could imply maliciousness happening on the system. And the next part is mitigation, when you use those signals and prevent something bad from happening. Signals also form the broader context. When you detect something bad, you want more context into that. What was going wrong? And that's where signals also come in. So these two aspects of security are, like, how signals and mitigation go hand in hand here. How do we do signals and mitigation right now in, like, Linux land? And I'm going to discuss with this little thing here, like, what are the issues we have. So let's say I want to update the audit subsystem to log environment variables. I changed the kernel code. I changed the policy language. I changed the user space stuff. I changed, like, the... You update your user space application that builds on top of your audit stuff, and then you change your pipeline that is digesting all this audit information to figure out if something bad is happening there. And then you realize, oh, something is really bad. Let's say there's a particular malicious actor that has a certain LD preload signature, and now you want to update the LSM. The mitigation is kind of disjoint at this stage from the signaling aspect. So you go, you open up the LSM. You update the policy language for, say, AC Linux or APAMR. You add detection or mitigation logic to SecComp. And by that time, the vulnerability's already moved on. And you're behind, you're one step behind in this case. So that's the primary motivation of why it's actually to make the signaling and mitigation go better, well, work well together. I'm also going to talk about other signals than just environment variables. So think of a process that executes and deletes its own executable, right? Like, that's a signal that you get. You could do, people do that. When I'm working on my workstation, I could do that, just remove that stuff. It doesn't imply that I mean harm to the system or the organization. But it could be an indicator of something bad happening on the system. Another signal, this looks really shady, though. Like a kernel module that just loads itself and then hides itself from proc modules or whatever. Why would anyone do that? Well, you need to figure out why. And then if there is a concrete evidence that this is bad, stop it from happening. And we talked about the other signal as well, that is suspicious environment variables, right? LD preload, setting his file size, and then doing something bad with that. Not necessarily bad stuff, but could indicate something bad happening on the system. I'm going to talk about mitigations. These mitigations could, some are applicable to servers. Some are applicable to both data centers and servers. So sometimes, you shouldn't be mounting USB drives on servers. Maybe you want to whitelist at some point, some USB drives to be mounted. Maybe you want to block dynamically after a certain point. That's one of the mitigations that you could do. You want to have a dynamic whitelist of known kernel modules. So you know all the hashes for all the kernel modules that you want to allow to load on the system. This is more relevant for data center stuff. And most developer audiences also don't really compile current modules on the system. I would, like I would not like this to happen on my workstation. But again, you would want this to happen in a broader fleet of production stuff. And this is an interesting use case. Like you want to prevent known vulnerable binaries from running. Like let's say you realize that there is a vulnerability, right? You as the operator or somebody who deploys the operating system to your organization, you realize that yeah, this core library has some vulnerability. Now, you want to ask everybody to patch their binaries. If you have thousands of binaries, it's going to take time to update that. Now, you want to detect some fingerprint of that binary and prevent it from running in production. And you know that you can deploy these detection rules much faster than you can deploy the binary, but that's the nature of binaries. Binaries need to be recompiled. They need to be like recertified or like made sure. And there's a release process involved here. And you have your own concrete release process for deploying these policies. The key thing to take away from here is that white listing and black listing can be very dynamic in nature. And we want to enable and facilitate that. And this is the core of the motivation here. Can we make it easy to add signals and mitigations? And in a unified way via a single API. So, how does it work? I know you've probably read the abstract and you've heard of EBPF, maybe. But sure, fans, how many of you know what EBPF is? Awesome, this is a great audience. So what we're trying to do is we're going to have a new program type. And if you know EBPF, EBPF is essentially a number that maps to a function in the kernel. It is also verified and make sure that the program completes and it is non-turing-complete and stuff. But think of those opcodes or helpers as an API to define those signals and mitigation. Rather than just updating audit and SeLinux, you define these helpers and functions. You do your signaling and then you do your mitigation using the same API. What we want to point out is that this is not BPF pro-type trace. This is not BPF pro-type like SOC or C-group SOC. It is security-focused, so it's going to be a security-focused API here. And we can build the KLSI, LSM, or your dynamic policy logic using these helpers here. I'm going to talk about why next security modules, because that's a question that I get asked a lot. So we want to target security behaviors rather than the API. CIS call is an API that you expose to the user. And LSM hooks, essentially, is like a funnel. You have 10 different APIs to enter into, let's say, execution of a process, right? You want to execute a process. You can have forces calls to execute a process. But the LSM hook is funneling into all of these. It basically caters to that security behavior. I can tell you a while back, we missed out on instrumenting exec. We add and we missed logging and auditing performance, like auditing process executions for some time. It's been a while, though, and we want to benefit the LSM ecosystem as well. So if you get enough feedback from the security community, if they're using your LSM hooks, they're going to tell us, maybe that behavior is something you should be hooking into. And if they can easily use the LSM hooks and develop on that, the LSM ecosystem benefits as well. So that is roughly why we are targeting the LSM rather than anything else. And I go into the alternatives like seccomp and I'd give a brief update on that later on in the presentation. I have an interesting anecdote. When I went to our security engineers, they would say something like, this is sort of similar. They would say something like, I want to log LD preload on process execution. And the interesting thing to note here is they never said, they never told me that I want to hook into the exec system call. They don't even care what the exec system call is. They care about the behavior when you execute a process rather than the API you use to trigger that behavior on your operating system. So that's what I really took away from that simple line of conversation that we had. Okay, so how does it work? Each behavior or each hook or whatever you may call it from the LSM is like a file in CISFS. This is the cool security engineer telling me I want to log process executions. I go and write a BPF program so that yellow thing there is a BPF program. You open the hook, the BPF, if you know, is a multiplex system call. So it has the same system call behaves in different ways based on the flags you pass to it. So you load the program, you get a file descriptor back. And then you mash these two file descriptors together in a process called attachment. And this logic is executed when that particular behavior is encountered, along with the other LSM hooks. So app armor, SLNX, whatever, and KRSI at the end. And in this LSM hook, you can audit, you can say no, you can say you essentially have a way to communicate with user space. And I go, I tell how that communication happens a bit. It's still an open question, though. For all your programmers out there, I put this simple data structure. But just look at it as a mapping of a behavior or an LSM hook to the set of programs that you need to run. And the key takeaway is that there can be multiple programs that you can attach. So BPF attach multi is a thing here. But that's for all the programmers like me. So again, before we go further into the example use cases and stuff, we want to keep one thing very clear. We want to keep the helpers very precise and granular. And if anyone has seen or used the BPF helpers that are in the tracing part, which are really good for tracing, I'm not criticizing the helpers themselves. But BPF probe read would give you access to memory from the kernel. And this is not from a security perspective. This is from an API perspective. You get that blob of memory. You get your kernel headers from somewhere. Then you try to template that memory into the kernel headers. And this makes it really hard to deploy on a large scale in a backward compatible fashion. So we want to make sure that we don't use any kernel data structures. We keep the helpers as granular. And how are we going to do that? We're going to talk to our security engineers, exactly ask them for the information they need, and then make our API on top of that. So keep things simple. No broad helpers here. And then if you look at the overall structure, that's what I was telling you about. You'll have multiple LSM hooks. You'll have EBPF programs attached to that. And there will be some user space utility that loads your EBPF programs into that. This could be an agent or a daemon. Then your EBPF programs can then write auditing data to a buffer. This is currently Perfevent's buffer. I have a discussion slide on that, whether what this should be and could be eventually. And then your user space daemon can ship all these logs to your detection pipeline or whatever. You also have more options here to filter data, because if you're generating large volumes of data from each individual endpoint, you're going to overwhelm your detection pipeline and your response phase here. Key thing to note is that, OK, I think I missed my key thing to note here. Anyways, oh, key thing to note here is it's not just for workstations. It's not just for data centers. We want to do detection on both. Bad things can happen in both areas. It's less likely in Proud, but we also are targeting our endpoints as well. OK, I'm going to go into some of the other alternatives available here. And I alluded to that previously in the presentation as well. In audit, the mitigation needs to be handled separately, so you add your policies and your rules and stuff. There is some performance overhead if you enable audit by default. I did some very crude benchmarking for that, but I'm sure you are a more experienced audience with this part of the system here. And it also has rigid formatting constraints. So you get strings back. You need to update your user space application, past strings, or user space data structures. They need to be changing. And when you're talking about security and detection, the faster you can respond to a threat, the better you can be. And this sort of limits that aspect of the loop or detection and response loop that we have. Why not seccomp and BPF? I have an extra E there. LSM maps, as I said, better to security behaviors. The syscall, like, yeah, we were talking earlier on. I didn't even know whether a syscall footprint in the Linux is increasing at all. But it seems like we're talking about adding new syscalls, which if it does happen, then you need to update all your seccomp logic to incorporate those syscalls. And then it goes into maybe you could also deny everything by default, but that's going to make your developers very angry. So it's the balance between API and security behaviors here. This is something that I'm not very sure of. I think the previous talk did mention something about there not being a race condition when reading user space arguments or something about that. But from my understanding, there was some sort of a race when you have, maybe this has gone better in the previous kernel versions or something. So I would keep that with a grain of salt. We also talked about the current solution we have is based on K probes. And it also is like, if you use K probes and EBPF, it's very flexible. You can hook practically any function in the kernel and build your logic. But then you were, again, adding dependencies on your kernel data structures. It's going to be hard to, you have to keep recompiling. You have to keep, again, the backward compatibility of the software lifecycle aspect of this solution is difficult. Also, K probes is not a stable API. The function might just disappear. We also saw preemption is disabled, is not disabled, are the IRQs disabled or not? Am I holding some weird locks that I need to take care of in this part? And that's something that the LSM sort of guarantees in their hooks. There was also this thing called Landlock, which I gave a brief look at. But Landlock is, and if somebody from Landlock is here, we should also talk. But if it was geared toward security sandboxing for unprivileged processes is what I could gauge from that. Whereas we, I think I had a slide on that before, but I removed that because it was taking too much space. We intend to run KRSI, at least in the very beginning, as the loading of privileged PPF programs requires a capsis admin. We will respect that. Modifying LSM policies require capsis admin. We'll respect that. So it's essentially root. So there is no unprivileged sandboxing yet in the KRSI stuff, while we are initially beginning this project. But then again, this is a slurry of comparison between why not Landlock and why we want to use that. So we're going to do this case study. We're going to see the thing I talked about, you might have guessed, is it's about environment variables. What do you want to do? Like you want to audit the environment variables and process execution. Our security engineer would tell me, that should be easy. Like you just write some bunch of code and have dumped the environment variables out. But it's hard. The environment variables can be 32 pages long. I was myself not aware of this limitation. But when I looked at the Linux bin PRM struct, it's like the index to that is max paid size is 32. And this was surprising. So this is where you talk to your security engineers. Don't tell me I need all the environment variables. Tell me what you exactly need, and you structure your API around that. So the first possibility is, give me all of them. The second possibility is, well, give me exactly the value of LD preload. And then you leave it up to the user on how much buffer memory you want to allocate. What do you want to do in the case of an overflow? And you're kind of limiting the, you're kind of restricting your limitation, or sorry, relaxing your limitation of 32 page sizes there. Note these two helpers, by the way, have an issue that they can cause the code to sleep because the environment variables are in user space. And user page, it requires, BPF doesn't like sleeping. It's like, it doesn't sleep at all. So this is something we have to keep in mind. There's a worker on for this. I am presenting it at the end of the solution. But that's also a design thing that you need to, the flexibility around data format. This is also pretty cool. So we talked about audit being not, like giving you some flexibility. Here, the user is writing the BPF program, or user, I say developer, or I say the person who's making the security product. I'm clarifying that here. So they choose their data format, and they can use either the same data format in their user application or security product, or they can use the different stuff. But the choice is up to the user, and Kernel doesn't have to worry about the data format here. Kernel just worries about the APIs that expose to get the data out, and done. And the user space can choose whatever weird formatting they want to do with that. OK, now I come to the tricky part. I would say KRSI is very new. These systems are very, very well developed, and they're very, again, this is like a very, very ballpark comparison, but it's a check on whether what we're doing is in the right. If KRSI was taking so many cycles in, like if it was like very high overhead, this is where we would have stopped, or we would have looked for other alternatives as well. But as it turns out, again, that's solved for you. Each of you can take a little bit of that. The workload that we kind of used to benchmark this is like there's a simple NOOP binary that basically does nothing, and you execute it 100 times. You take the average of these 100 executions, and that is like one point of measurement, and then you do it end times. And then you plot a distribution of all the times you get back. Also, we try to quest the testing environment as much as possible by pinning and by task set and all that so that the scheduler overhead doesn't matter there. So just to make you aware of the testing environment here. So vanilla system, no audit, no KRSI, nothing going. You have, please note that the x-axis start at 500. I'm not playing a marketing trickery here, so this is the x-axis starting at 500. Now, it averages at peaks at around 570 microseconds per execution. You enable audit, and this is with no rules. It moves slightly, but it's not too bad. Execution is in any ways like it's an expensive event, and it moves it slightly there. You enable audit with an exec VE, and ask the exec VE syscall to be audited. Boom. The distribution becomes less predictable, and there's a lot of overhead here. Again, this is not the same information that I'm putting it out, so I'm just reading the environment variables and a few process parameters here according to that. But it's also a testimony to the specificity that you can add when you're writing your eBPF helpers. You choose the data you want to audit, and then it says small overhead about 533. And I kept adding more information, like copying more strings, TGID, like PID, and then you have UID and stuff. The distribution still kept staying the same if I was not doing something that was non-trivially expensive, like copying environment variables and stuff. So the performance doesn't look bad. It could be very different based on the helper that you're trying to use, but we just have one in our prototype for now. OK, why is this not moving forward? So that's how it looks in comparison, by the way. In overall distributions and stuff. The key takeaway, though, here is that the main motivation is the hygiene around the signaling and behavior part, a signaling and mitigation part. But there is the added benefit that it's likely going to perform better. There's no guarantee on that, but it's the initial signaling for that. OK, now comes the demo. OK, we have. So I saw somebody did an excellent job doing a screencast. And I was going to do the demo myself. My hands are also a bit shaky right now, so it's good I did a screencast while I was sitting there. So you start a VM. I start a VM on my machine, and then I start up my SSI show that VM, and I start up the KRSI helper. This is in the prototype encoder in samples BPF. And I look at the hook. So you can cat the file, and it just gives you the name of the hook there. And then you say, you run a process. And this is the output you get back, essentially. Like this is what I was logging. And if you see, I was initially logging just the value of the environment variable. But then I started adding more information, and that's where performance kind of stayed the same. OK, this is an empty slide. So what's up for discussion? What do we want to, what's still up to the community to discuss, right? Users of the perf ring buffer, first issue is it's called perf. So it should be only for performance stuff, but it is actually used in other places in the kernel where its scope has gone beyond performance measurement. So BPF already has this helper function called BPF perf event output. And we're just using the same, and it works. So it's quite fast, but then there is the usual trade-off of perf CPU buffers versus should I do a single buffer and add synchronization? So these sort of design decisions are yet to be made here. What do you do when there are 100 CPUs on the system? Is it going to lead to a large overhead when you're in your editing code or something like that? Again, like I talked about this, sleepable eBPF. This is like if you allow eBPF to sleep, the helpers become much simpler. And I'm going to allude back to the helper that I had. There is a discussion that is happening in the BPF microconference in Linux plumbers where they're trying to see whether they should allow it to happen or not. The patch for that is quite trivial. The main is the ideological discussion behind whether they should be allowed or not. The way we avoid it right now in the single prototype code is you do pre-competition in the LSM hook. So the LSM hook, you copy all the N, B, and R, V back to the kernel memory, and then you pass it to your BPF program in the context. This gives you a guarantee that the data will be available, but it's not there. And you can also do smart stuff. You can have the LSM hook look into the BPF program, see if the environment variable function opcode is there, and only do the pre-competition then. But again, this would not be required if the BPF was allowed to sleep. That's about it. I'm going to leave it up to questions because I guess you have a few questions now. I'm sorry for being that person, but just a clarification. There is a talk too. You mentioned before with SecComp. It is just the API that we devised with SecComp just allows you to check whether you lost it or you won it, essentially. So you get a cookie back, right, when a syscall is performed. You open the file descriptor to proc, pit, mem, whatever. And then you pass that cookie into another SecComp syscall and basically ask, or Iocl, it is, I think. And then you basically ask it is that cookie is still valid. And if that cookie is still valid and you know your open is correct, then if that cookie is not valid anymore, then, you know, task has died. The other thing that some people might think of when they think about SecComp syscall interception is, and that's what's mentioned on the side, we, you never move on effectively in the kernel. Once you hit a interception target, it goes to user space. And user space has to do all the work. The kernel will never continue. So that means that user space can safely duplicate the arguments of the function from the process and then evaluate that and then act on that copy, preventing the caller from ever, like if it changes its memory after you've done the copy, you've never evaluated it, then whatever. Like you don't, you never tell the kernel to continue doing anything on the original pointer. Which last point also means, for example, if your argument is also used as a return argument, it's on user space to fill on the return argument because the kernel won't fill it in. Okay, so I mean, this is about, I was not really sure about that pointer anyways. So thank you. Can you come back on the Linux slide, please? Which one? This one? Linux. Landlock. Oh yes, sure. Oh, awesome. So just to be sure, the landlock is designed to be used by privileged processes, but of course, it can be used by privileged processes too. Okay. So basically, landlock is a framework to help you design your own access control, but it could be used as a same way to audit any event you want. Of course, like for KLSI, you need to have appropriate hooks and appropriate helpers to have a fine-grain audit or access control. That to say that I'm pretty sure what we did could be really well integrated with landlock. I think it's quite close. Everything you talk about, why you didn't implement it with K-Probes or other stuff, I run through the same reasoning to create landlock. Sure. And yeah, so I think this one issue with the current way you create the file system, security file system, well, the API, the username API. Right now, I think you tied it to the LSM API. Yeah. So I think it is a major blocker because the LSM API is not meant to be public because it can change every part of the kernel. So I was thinking about this. Like this is whether we should have a shadow layer on top of that or whether we should expose the LSM API directly. Sorry, pardon. I personally felt that if you allow the LSM API to be exposed at least, in this situation, you can do some stuff in user space that would prevent, like you could have your own mapping in user space, but you need to make the LSM better and have like the security behaviors built into the LSM. Like that feedback going back from security community, back into LSM, and then they build more programs on that, then it goes back into the LSM. I feel that this process, this could benefit, but I think I'm open to both, like talking with you about AlignLock stuff and also about like this individual, this particular design choice. Because I just include the same review from some, a lot of part series. So, yeah. And so yeah, about the helpers you want to create, you want to create a set of eBPF helpers. So that's exactly what I want to do with Lannock too. So, AlignLock is focused on access control, but access control is very close to an edit system. So I'm pretty sure we can do something together, okay? Okay, so you are implementing minor LSM, right? So it means that you will need to rebuild the kernel in order to use it. Have you considered any alternatives that wouldn't require rebuilding the kernels, like K-probs, you've rejected maybe F-trace or anything of that sort? So you, in this case, if the helper is already there in the kernel, you don't need to rebuild the kernel, right? Like if you're adding a new detection logic in this case, if the KRSI helper or all the API is already there, you don't need to rebuild the kernel. You can attach your eBPF programs to your LSM hooks here, and it just works without rebuilding the kernel. So that's actually one of the important points I did miss in my talk, but yes. And about the question between major and minor LSM, I haven't thought yet about like whether we should allow access to security blobs from the BPF thing. I would like to stay a bit far away from that till the use case actually comes up because BPF programs can do their own sort of like state management if you want to build a state machine of stuff with maps, but then this is where you want to be careful about memory usage and whether this makes sense to build in the kernel. But until the use case comes, we want to like not make a decision on that. Okay, that looks like it. Thanks for the question. Thank you. Thank you.