 Good morning, everyone. My name is Song Liu. I work for Facebook, aka Maita. Today I want to talk about the possibility to debug with BPF, like BPF2, within a container. That is not globally. You don't have global permission. So we all know debugging with BPF is great. We enable great tools. But the key, you need to have CAP BPF, or CAP admin, CAPSYS admin, CAP NAT admin, and more. But the CAP BPF is not secure. It's not really secure or not secure at all, which means we cannot really do that for containers. So the question is, is this possible to find a sweet spot that we can keep? Is it secure? We'll still keep it useful, at least for maybe a large portion of all the use cases. So here are some ideas. I think right now what we really can do is have to pre-define the tools. You use either a set CAP or have some ping program to say, I know this tool is secure. Let's use this BPF. It's unheard the system. But obviously, the problem is it's not flexible enough. And you only have certain tools. Yes, please. Hello. So you said tool writers define secure BPF programs for non-root users. How do they define that? Is it by a hash map? Or is it by certain programs? Which line? The first point you were mentioning here. Or is it just something you've already tried? Or is it just an option that you haven't tried yet? I think that's what currently we do with other tools that you use special capability. You have, for example, a password. You have the capability to do the password for non-root You can change your own password. So we trust the tool itself is well-written and gave capability to that tool. This is like an excellent policy of some sort, I guess. How is this implemented? Define secure BPF programs for non-root users. If it's, say, if it is not. I mean, it's like if the tool writer writes something, you do certain things, very well-defined, pre-defined things with BPF. And you make sure that is secure and you allow non-root users to run exactly that program. Okay, okay, yeah, that's... Sorry, this is probably not clear. So, but there's not a key, hopefully. Okay, I mean, the answer, I guess, is like it's not flexible enough, so let's not bother about implementing it, I guess. Oh, maybe it's still useful in some cases, but that's not the key I want to discuss today. We can take that offline. Yes, thank you. Okay, so the second idea, as the main idea I have for this discussion is like, how we do some mandatory filtering of your BPF program based on the ownership, and we'll go to the ownership later. So if you have a non-root users program, we'll only trigger on the events that you own. For example, like, this is you, the ownership could be your current task. If you are in the process context, you trigger an event on your own task, you can see that. If you trigger an event on some other task, we're gonna, like, the kernel will filter that for you, so you will not see that. And another thing is the perf event, if you, the non-root user can open task based on the perf event. And if you have a FD to a perf event, attach a BPF program to it, so that we can, with some other limitation, we'll see that's relatively safe for the non-root user. And also if you have ownership to a socket, you probably have a socket event, a tracing event on it. And another idea, I think, part of from how it's like, whether we can have some security to enhance the map. For example, it's like a task, local storage, if you have non-root user, you have only access things for your own task. You good question, Brandon? Sorry, the second one, mandatory filtering, filtering based on ownership, that will mostly work. There's asynchronous stuff like IO completions where you don't own the IO completion, but you wanna trace it. I've thought about, I've dealt with this a lot, and I think that we can make it work. So for example, if you can only see events that are in your context that you own, and you can't see, for example, IO completions, you just have to walk up the stack until you own the context. So if you go up to a Cisco level, then you own the completions. And so it wouldn't, it would be annoying for me to deal with that for, say, container users, but I could make it work. It would break a whole heap of tools, right? If you said you can't see all this async stuff that you don't own, but my point is I think there are workarounds in many cases. Yeah, that's, like, I get the same feedback last night on the flight here. So yeah, let me go over that in a bit of detail, so. So I did some homework. No means this is the whole picture of BPF use cases, but this is my homework. I tried to be not biased about this. And I look at 41 tools on the BCC-Leave BPF tools. And 24 tools just filter on current tasks and the process context. So I mean, I think we can easily transfer if we have a current task based filtering that enforce on non-routes created BPF programs. And three of these tools was the start and the model Brandon just mentioned. You start tracing something on your process context and actually the end happens in the IRQ so you cannot use current tasks to find the ownership. So one example is to trace the IO latency. You start the read, you want the exact latency happen in the IRQ context or soft IRQ. And interestingly, there are more tools using contact switch than the start and the model. And then the switch schedule switch is kind of a really useful task and it's probably worth some special attention to make it work. And there are three tools use perf event and I didn't pay much attention. I assume if you only do perf event for your own task it's 90% working. And there are four tools attached with the socket with TCP connect or something. And the last three tools, they're really in the IRQ, soft IRQ context. And that's probably the trickiest we need to handle. So we start with current task and perf event. And basically we filter, we implement, if we implement a filter based on current task that is good for process context. And if the non-root user can create a task of perf event we can use that to get a part of the features done. It means we cover 24 plus three tools here. And the start and the model. This, what Brandon mentioned is one of the easier ways than what I have here, I guess. But the one thing if you really want the contacts in the IRQ context we can have the start program which is filtered based on current tasks. And we add a key with the BIO with SKB to a harsh map. I think we probably want BTF enabled or maybe even referenced map in this case. But that's something I'm not quite sure. So basically we, but the rough idea is to use this harsh map as the filter. We populate the element into this harsh map and the start program. And in the, and before the end program which is in the IRQ, we use this harsh map as the filter. And if this not, if the data coming from the into the end program is not in this map we just skip it. And of course once we use that we need to free up the to free the key from the after we use it in the end program. Yeah, that'll sometimes work. For things like IO you've got IO mergers and all sorts of things can happen before the IO is issued and like determining who is the, all the owners is a mess. I think they'll work for some things like schedule or wake ups because I want to know who woke me up but that's firing in someone else's context and that's pretty clean. So I think that there are some places this will work and then some things like networking and IO which do mergers between different requests from different owners that this gets a bit messy. Yeah, that's very true. I have another question in this context. Would it make sense? I mean, just thinking out loud, maybe it's completely stupid but to abstract that ownership model in terms of, I mean, for example like a program could have multiple owners, right? And would it make sense to abstract that to, I don't know, something else like an identifier where you would filter on that or like a bitmap or something like this or? So I'm not quite sure I follow this. So but the idea is here is like, the picture I have here is PPF trace. Basically you run something for a shorter period of time and just like the user started that in the container. So I don't see like multiple owner of a program is a common case here. I'm not sure whether that answers the question. Oh, it's fine, okay. Yes, so I actually, we have other use cases there. We have multiple owner of one program but that's, it's probably gonna, we're probably gonna do resolve that in different ways. Okay, just pretend I solve this. So, can you, yeah, so for this one, start and model. So this is still, so this is you describing how the tool can be written, the program itself to be safe within a container, right? So it's not the, there is nothing automatic here. Like kernel, kernel cannot do this kind of stuff for you, for like arbitrary program to make it safe in a container. I have some idea on that. But not for this case, right? So like just in general, pro-pro-breeding, like tracing tools do pro-breed, right? Pro-breeding in kernel, there's no, like how pro-breed can be filtered. Like you're talking about the whole tool runs in like current test, sure, but then what's next? Like as soon as program starts, it can probably anyway like like arbitrary address. Yeah, I think pro-breed is probably one thing we want to disable for many use cases, or one thing I was thinking, so we have a, for example, we have a task pointer. If we use the core to do the read, that's what we call that's allowed. But if you, or maybe we limit how deep we go, with core, like out of those 24, like so if you go back to the previous, right? Like 24 tools use filtering on current tasks. How many of those use BPF core read? Like some variant of that. I didn't track, but not so many. But probably like every single one, no? Not so many. Not so many, because as Alex said, like once you do pro-breed, like all bats are off. You don't know if you're reading like something from the kernel, from your task, like, so. So basically you're saying like we should just disable pro-breed for such tools. Probably if you do pro-breed of random things, or if maybe you can pro-breed certain BPF pointers, or you can do only certain types or something, you somehow clean a little bit ownership. Well, the problem is like if you do like actual BPF core read, right, like the macro. Kernel doesn't know that you're doing core-enabled read. It's just like random value that it's reading from. The only case where Kernel does know that you're reading something that's like BPF defined is like when we do direct memory reads from f-entry. That's the only case where we can technically. Yeah, TB. TBBTF. TBBTF? Yeah, TBBTF. We have, among those 24, I think maybe half of them are TBBTF. Other than tasks, do you support ownership such as for C-groups or group of tasks? Yeah, I think that's definitely a way to do that. Actually, I see there's a talk later, or a discussion later on like C-group, C-group defined K-probe or something. In some cases, I think in the workload, we may have a lot of short-lived tasks that are resizing within one C-group because the tasks are very small and have a very short lifetime. That may kind of increase your overhead of map size or if you store the task in the map. So I guess you want to support not only single tasks, but also support a group of tasks like a C-group or task group in that case. Just one comment. Yeah, that's something we, if we agree this is a way, that's definitely one way to improve it. So yeah, again, pretend I solve this. So in the socket, and the socket is a common use case and I just throw a random idea, if a root maintain the program that will populate the socket to a map, and we use that map to do the filter for the actual user-started BPF program. And that's like how to, this is whether we can use the root to enforce that or as Alexi mentioned, we need to write the tool carefully, meaning we cannot do that's gonna trade off the flexibility of the tools. That's going to be some details that we can discuss once we get sample code. Some. Yeah, that's a different argument. My argument is like we can use some infrastructure to enforce that without writing the tool carefully. The other side of the argument is like, no, you cannot do that, you cannot get that much flexibility, you have to write the tools carefully. And trying to push on the more flexible and offload the security safety check to root on the programs. But we still, the object ownership model and all that stuff, it still doesn't prevent you from writing, like of course you'll have mitigation in place for like side channel attacks and stuff, but like the new things keep coming up, like there was the BHP stuff that came out a few months ago. So there is still a surface for users to craft a BPF program that would be, that has nothing to do with tasks, constructs or sockets and in this case, just plain vanilla BPF code that could be used to like create side channels. So how do you, this doesn't solve that? I have no idea about that case. I need to learn about it. Jason. Could we, oh, that's really good. Could we maybe solve this with signing in a capability model and a policy or something? I mean, we're talking about signing later this afternoon and so I'm just wondering if that might be the solution to this problem. I mean, it's a solution to whatever this solution to. Great. That's what we're here for. So move on. Yeah, more like button. So for the next is the switch. I just have this idea. So we have the switch, it has the previous and the next. So if you have two program, one handle previous, other handle the next, we can use that as the owner check. If I program a handle next, I only trigger on next for my own task. Whether this is my own task or the task and have visibility from the PID namespace. I'm not sure I understand PID namespace very well but I think that's by splitting this into two program and enforce the filtering for each of them, that sounds doable for me. So yeah, more like button, which is pretty good useful rates, but of course that's a lot of work. So how do we do the filtering? I had ideas so we can use a trampoline for that. The route or even the kernel load of entry program and we have some mechanism to load that to every known route. PPF program and if the filter says we should skip this, we'll just skip that. Maybe I'll mention the use case that we had. I think it might be similar. Is we attached to lots of K probes? But we scope them at the very first couple of lines by, well, we would like to scope them every first couple of lines by the pod, which would be if in Kubernetes speak, which would be like the container C group. So it was basically a C group filter for K probes or F entry that would be run before you jump into the PPF program. Is that what you're proposing? This is actually what you described and what Stan is doing. So it's a group-based LSM hooks. It should be generalizable to like not only LSM, like LSM in this case, just an example. I want to say this here I try to say, how do you like let the root to enforce that? So no matter what the non-root user load program, this is a enforced check. It's not part of the generated, it's not the code generated by PPF trace non-root. It feels close though. No. It feels like to me that you're in the same space and could probably use that work for this if you generalize it to K probes and trace point. And then what is also interesting is then I have another use case on top of that has nothing to do with the non-privileged debugging. It's actually a use case where we have a lot of K probes and all of our programs right now would just have a switch at the very front of them whether we could then delete, right? So that instead of calling, like imagine wanting to run like a full S trace type of thing but only for your most paranoid users, right? Like you don't want to run that on everything in the system. So it's very, I think that the main thing here is that the filtering pit is useful, right? Like this is why this is happening for the LSM stuff but the non-root part is the most questionable aspect of this, right? Like can we get, should we do it for non-root? Are there any security concerns? The filtering is just, we need it. Like there's no questions about that. Yeah, always. Yeah, to continue that thought, there's a point where, like if it's for observability tools there's a point where I just need to go do host tracing anyway. So you give me a filtered view of disk IO events or whatever and to fully understand it I need to see the neighbors and how much I'm queuing on them and so I'm just gonna go to the host, the container host and then instrument it anyway. It'll be nice to have some improvements to do some things within containers or non-root but ultimately to get the big picture I'm gonna have to go do root stuff anyway. Yeah, that is true. And actually if we have a lot of users doing this itself can slow down the system itself. So it's pretty, I agree. There's a lot of problem we need to solve down the road but my point is, wants to make it like, it is super useful. We want to just give it a try to enable some use cases for people without the root permission. Yeah, I mean I talk to developers all the time who are in containers and they've got no way of beginning playing with BPF trace and things like that. Even giving them a limited subset of CIS calls and these trace points and a few others will give them a way to get started and write their first one liners and programs and then understand the value of getting full access. A different question you might have it on a future slide is apart from filtering the event, what about filtering the arguments? So if I'm able to see schedule switch, am I able to see each of the arguments because some of them afford the other task and not my task and so they should be filtered. You may have arguments as another slide. Yeah, this is actually filter based on argument. You get the previous and the next. You get the one program you can only access previous and you are filtered based on previous. You need another program to access the next and that's only your filtered based on next. Okay, so if I'm looking at the trace point for schedule switch, you're filtering the arguments that I can access. So somewhere it's like you're allowed to access these. I think the verifier you'll take a look at if your program access previous, we will filter that on previous. If you access the next, you will filter on next. So where is this stored? Like where would the metadata be of, here are the trace points and here are the arguments and members that you're allowed to access and here's the arguments and members you're not allowed to access. So the idea was like we have the verifier to generate a matrix or a set of flags. So you do this, you do that and when you try to attach the program, you use something to make a decision whether it's safe to do the attach. Yeah, I think there needs to be almost a database somewhere of here's 200 trace points. These ones you can trace. These arguments for these ones you can access everything else is blocked. And so that like that information that's essentially true if you want full access. But I think the case we can start with the SCAD switch to enable that because given how popular that one is there are so many other like not so popular trace point we don't need. Also some of them if we know this on contact switch would use current to do the filtering we don't need. We only need this for whatever we use the argument as the filter. I mean to Brandon's point about like creating this list of trace points, we already have stuff BTF ID lists in the kernel and we could do some sort of pointer tagging to say like this is accessible. But I just don't see the exercise of the unprivileged stuff until we fix the security stuff and it's very hard to fix the security stuff there. I totally agree. For security stuff like 80% of, at least 80% of time you'll say something I'll take it as my point. We could build some mitigations into the BTF program itself that would like for especially for some specter like attacks. Some aggressive more aggressive mitigations, right? Like like kind of flush after. Yeah, actually let's just trigger another idea. So instead of doing the entry sampling, we can have the verify to insert the filtering into the program. Can we? I have this observation that like whatever pre-canned capabilities we give to users, someone will come and say that they need more flexibility. So like this kind of trend in the other direction like let's pre-define like what we can filter on. That will work in some cases but then like there will always be users like, oh we need like just this tiny tweak to this filtering and like all the stuff that we will do in Kernel will just not work for them. So how do we resolve this tension between like flexibility and like that was usually our answer, right? Like oh like you have this complicated like kind of hard to formalize requirements, use BPF code like you have control over like the filtering all this stuff, right? Like here we are moving the other direction. We're saying like let's pre-define that like you can filter on task or like on C-group and stuff like that. It's kind of anti-BPF-y, right? Then you move this back into the Kernel, right? The group on is only partially working. Or you have something like the, what Bjorn did for XDP, where you maybe generate the filter through BPF in the Kernel, but I don't know. It's so well. I guess just general question to the audience is like how much do we believe that we can actually statically enforce that some program is contained to like reading data about current tasks, for example? So like even with like, let's do mental exercise, right? Like you have this TPBTF that has task struct type, right? And like so, very far knows that you can like BPF information, you can follow pointers and all that stuff. At some point task struct has a pointer to another task struct that you are not allowed to read. Like can we prevent iterating through like my current task to another task and then doing whatever I want? Like do we even believe it's possible to like enforce it like you are not going to go outside of like those task current task boundaries? Because I personally don't see how you can ever do that. And by the way, I checked the BPF tools. Like it seems like half of them right now, at least use BPF core read directly. Like which for kernel is like read random memory. It's like half of the tools. Like if you disable BPF core read, can you even write a tool? Like some, some partial cases, but. I think in general, like this whole unfreeze stuff like with the latest Spectre BHB, the way the program is written there, it's not doing anything. It's like 10 instructions, no jumps, nothing. And we have to like, there is no practical way to detect that it's malicious. There's like, it's just not doing anything, not reading any kernel memory would say or like. I'm saying, I'm saying unprivileged like until like Intel fixes the ACPUs. Like we cannot, with software, software cannot work around hardware bugs. If this were true, we would have had land localism being implemented with BPF, right? Like this was a wish list for a lot of things. But it's practically, it's never going to be true. So the only case thing that you said in the first slide, which actually, where you said this is infeasible, is the one where is the route we might have to take if we have to allow certain like unprivileged processes to run BPF programs is pre-canned, signed, verified based on a policy, those sort of programs. And the policy might actually be signature based there because as Alex said, it's a very vanilla, simple instruction program that doesn't do anything fancy. So you can't have anything that says like, okay, I won't allow map accesses or pointer accesses. It's just the way CPUs are built that you could trick them into doing stuff by just a few lines of assembly, right? So that's the root of the issue there. Oh, we can try for non-intel, maybe. This is, I mean, Alex said Intel, but you can generalize. This is any modern processor that is doing some speculative execution is going to suffer with that. And most modern processors do that, so. Yeah, I wouldn't make it too restrictive. I mean, do you want a situation where S-trace is more powerful than unprivileged BPF for observability? Like you can do more with S-trace. That would kind of be sad. Like we want some flexibility. My concern is signing programs and having things too canned and like losing the flexibility of doing BPF-trace one-liners. Oh, no, this is something we'll discuss in the signing thing, right? Like this is where your trust for the tool itself. Oh, for the tool itself, sorry, yeah. So we'll bring that up in the, we want to allow dynamic generation, but then you move the trust boundary. That's where you said the tool should be nicely written. And then where you verify that the tool can't be abused to generate these gadgets there, right? Yes, so, but I think we go that direction. We need the tool to be really, really simple. So you cannot say like, I can make a BPF-trace secure. It's like, it's just too large a body, so. I think the idea here is that like we trust BPF-trace to verify that like some script is safe in one way or not like signed, for example, right? Like you can sign the text of the script. And if you trust BPF-trace to verify the signature before the BPF-trace actually is compiled in the code, then you have like this chain of trust, basically. We trust BPF-trace to trust the script. That that's how it would work, probably. Right? Yeah. Well, that's one way. Sorry. Yeah, so. So similar like we, like, LibBPF is a loader for BPF program, right? So you can teach LibBPF to verify signatures of like the L file, right? And like if you trust LibBPF to verify this, then like you can sign the L file without like, you know, like with all the core stuff, like all the stuff before LibBPF actually like modifies the code, right? So it's again like trust, trust chain. So similar idea. And tools won't be simple, right? Like tools will get complex because tools allow you to do complex things. And will they, is this a foolproof solution? When you sign a binary, I actually trust the binary. It comes to all of the thing, right? Software is inherently buggy and somebody will find an expert, somebody will fix it. But you sort of your, it's the best possible barrier to, like the highest possible barrier to entry for like, getting, I know this is reasonably safe. So I sign it. That is the risk that I'm willing to take. I'm not willing to take the risk that I will allow any BPF program to run. I'm willing to take the risk that I will, I'm allowing these BPF program generated by BPF trace because I think it's reasonably safe. So it's, nobody can say for sure, right? Like this is where it depends on your threat model there. Yeah. I see the point. A lot of the checks, the filtering is easier to do in user space or in the BPF program before we get into the kernel. So, and but I think the other side is like, if the kernel can provide some enhancement in secure that's what I hear the answer I hear is like that's impossible or doesn't work the effort. So I think for like true and privileged like signatures and everything, ideally like we, like as a BPF community would work with hardware vendors, with CPU vendors, with Intel and ARM and ours and actually ask them what we need to actually make it safe. Like it could be like if you sort of like and branch instruction that Intel implemented we could do something like this for the BPF and say, well, never execute BPF program in a speculative way. That was just like eliminate all known attacks so far we had is just needs little bit extra work on the CPU side. I think you lose performance, yes. But they do have an ARM 64 maybe to some degree, I think to disable speculation. I'm not sure how expensive this is actually, but Well, I'm not saying disabled speculation completed like and branch instruction is like saying like you can only jump there like when you knew that it is a target of indirect branch. I say guarantees that there will be no side effects for this particular branch. I think like the performance argument is of course we want BPF programs to be but it's a step towards allowing, so Brandon was saying it allows users to craft BPF programs and try them out if they are safe. They'll be slow, like they'll be not the fastest ones. Sometimes that is enough for me to debug something anyways, right? Like I don't need my program to be really fast in a simple setting. So maybe currently we could just do a barrier like MSR write for like to flush out like IPBB barrier at the end of at every BPF program exit or every branch. We just, it will be horribly slow, but sure. But the problem is you don't know whether that's actually enough or not. I mean, we would have to, yeah. This is a whole kind of problem. Yes. I just forgot one point and that is it's not just people playing around with this stuff and learning it. It's gonna be all of the third party companies who are coming up with observability agents and security agents and trying to get adoption. And then they find that people in container environments, they can't run them at all. So yeah, I think this is pretty important now that I think about it. So imagine all of those companies who are building stuff, agents they want users to run. So the question is how do we enable this then? Alexi had an idea, we should talk, we should work with the hardware vendors and figure out the right strategy about it. And it can't just be Intel, right? Like it has to be other, like the end branch stuff. Is there any equivalent on AMD or on ARM, whatever? So, but maybe this is something BSC could help with. Yeah, I mean, I have bad ideas like don't run containers just everyone run on bad metal and problem solved. Okay, you might get a lot of Nike, let's move on. So I think this is, as I call, I think people talk about this idea, not just me, it's like we want a BPF LSM hook on this BPF. So we have some program to decide whether some program can be loaded by some operation is okay for non-privileged users. And the other idea was that we had to verify because when we load the program, we really have no idea what it's look like. We are gonna do the verify to tell us what the program look like and we do the actual filtering on the attached times, something we cannot attach. For example, someone doing probe read and maybe that means everything really useful. But that's the idea. This pretty much sounds like how I was talking earlier, right? And what John wanted to. Yeah, it's the same idea, exactly same idea. So a little more like what this look like, this something is the defaults, the gigamodes. So if you are a non-privileged user loader program, you need whatever GapKeeper f-entry program to really run the program. If there's no GapKeeper f-entry program attached, you just always skip it. That is the case if you have like for whatever reason, the route forgot to load the f-entry program to enforce the security, we will just always skip the program. And the next is the BPF, BPF ISM, BPF hooks and also the verifier generate attributes. It's probably just gather existing. It gives you information we already got in the verifier. And that's it. Thank you.