 So I'm going to describe a problem we ran into recently. It became a relatively big issue pretty quickly. So we're working on some solutions. The only downside is I don't have any code today, but we'll probably have to have something soon. So stay tuned. I'll have a couple of proposals and I'll describe the problem. So this is a chart that we sometimes share with people to talk about what Tetragon does. And Tetragon is like a security tool that loads a bunch of K-Probes, observes the system. So if we're thinking from the networking stack, what we can do is we hook the sockets, which will give you all of your socket data, say your lifetime of your TCP, all of your UDP data, multicast, all that fun stuff. And then we hook the TCP UDP stack in a bunch of different places to collect a bunch of TCP states, statistics, lost frames, dropped frames, these kinds of things, histograms of latency. Imagine you want to know kind of if you ever see a packet that's in the 99% latency, you might hook there. And then we hook NetDevice, and that's where we start to get into trouble. The NetDevice is interesting because we can do packets per second. We can pull out NetDevice stats from the device, like we were talking about from the TX descriptor. You might want to know things like how many TSO packets have been sent, what's the average size of the packets that are being received, a bunch of sort of network device specific things even can come out of there. And if you're kind of a power user of your NIC, a lot of that stuff is interesting to know just for the health of your network. So we hook the NetDevice, and that's actually okay on the host side because if you're running Sillium CNI or some other CNI, you're in that data path already. So you get all the data coming through the CNI. The troubling case for us is this networking namespace. So the sort of secondary request is once you have statistics for your host network namespace, basically your routing framework and your control paths. The next request that we got was can we get a bunch of statistics inside the namespace itself, right? So Sillium, if you're thinking of a CNI for Kubernetes or any traditional kind of routing system with Kubernetes is gonna be outside the pods, but the request then is to get statistics about things inside the pod itself. So inside the network namespace of the pod. And in general, that's not really a problem. So you can think about this. It's not a problem for most things. So if you think about application data, we have SK message, SKB programs that run on the socket and some K probes even if we want to put them on the send message and things like that. So you can put your BPF programs at the application level, not a problem because they're not network namespaced. So they'll run for, even if the application is running in a pod, it does a send, it'll hit one of those programs, SK message, SKB or a K probe. Same way with sockets. With the stock ops, we have ways to say, irregardless of the network namespace, give me my TCP state so I can see every TCP state that goes into close and everything that's created so I can build like a nice map of all the sockets and the TCP state and the system. No problem, same for UDP. We can do our K probes and f-entry. Again, no problem. They work for both out of the network namespace and in the network namespace, L3 and Gracigas hooks, same story. You can use these to do like UDP manipulation and stuff like that. UDP enforcement even is useful there. But then we get to L2 and this is where the crux of our problem is. So how do you hook an SROV device? So we have customers doing this, assigning the SROV device into the pod outside of the control plane of Kubernetes, putting it in the pod and then saying, I wanna know stuff about this thing, right? It's not in your network namespace that you have the host. It's not in the Celium CNI host. But just like you told me you can get stats on everything else so I want stats on that too, right? Okay. So that's the problem. There's a bunch of different ways that we have to solve this and they're all really ugly and not very good, right? So the first one is like you can just jump into the network namespace and attach something to it, okay? Either TC or XDP. The trouble is you don't own the network namespace so the owner of the pod can just delete it or maybe they have their own thing there, right? So you have this kind of collision. That's the first problem. The other problem is knowing that it even exists from like a control plane side. We don't always, as the observability and security platform being outside of the routing and the control plane, we don't always get events just because somebody created a pod or somebody created a network device inside that pod. So we don't actually know when we would even want to do that. So there's always either there's a gap where we pull and we go okay, they created a pod, let's attach to it, right? Now there's the race or we do some other funny games to try to figure it out. Like you can hook when net devices are created in the system, get an alert and then go grab it but you're still racing with the net device create, right? So if you want truly accurate results, none of that really works very well. And then the other problem again is that you don't own the TC or the XDP infrastructure so you're kind of fighting with the owner of the network namespace. We do K probes, they work great because they don't care about network namespace so they'll get called but they're kind of slow and we can do polling which is just like every so often go into the kernel and find the net device. When you find the net device, collect a bunch of statistics on it. Iterators would be helpful but you can also just K probe a hook that has a pointer to the net device and pull them out. Okay, so none of these are any good. The first one is insecure so as a security product, you're gonna say like not a good solution but it's efficient so that's nice. The second one is secure because the user can't rip out your hook but it's slow. The last one's efficient secure but it's sort of you don't, since you're not in sort of the data path you're polling and the result will be if you're trying to detect like air cases or bursty traffic or apply SLAs you're kind of, you couldn't miss them, right? Like you can't see them immediately. You'll see them sometime in the future most likely but not immediately. So what do we want? Like what would be our ideal in space is we would like to be able to apply an XDP program to this thing. Our XDP programs don't care about the net dev per se so like we have one program we attach to all net dev devices. So it's not like a different program would be attached to different net devices, right? So we can learn just based on looking up the data that's coming through the XDP program what we're attached to and that's not a problem. We don't want the network namespace user to own our XDP program like to be able to manipulate it for security reasons and then we like the L3 ingress egress hook semantics so if we had something like that at XDP it might be useful. And the goal is to get something that's efficient and secure just as an aside some of the people that are coming to us have tried to use the hardware features, right? Like say your NIC if you're in SRV mode will have hardware features that are supposed that are there for security reasons but also just for like L2 bridging and routing at the L3 layer sometimes. So you can do things like put filters in for IP addresses and put filters in for multicast but this hardware is never flexible enough to provide all the things that we'd want to like burst detection, histograms like some of these things we've been talking about summaries and so on. And we just don't have the ability to program this to get the data we would like so we kind of throw that out first. Did I go backwards? I went backwards, I think yeah. Ah, so the first thought we had is can we just do BPF XDP link attached from BPF? So like when we get the NetDevice create hook can we, with KFunk just attach a program from the BPF side to the XDP? What that would mean is we would need to know where the NetDevice is so we have to have some way to get a list of NetDevices. If we're in that create hook we have that so not a problem. We need some way to get the program link. BPF programs are just some kind of pointer to it. I'm not sure. And we want some way to pin it so that like the other folks can't delete it from underneath us. I'll pause there for a second. Anybody object? Are we okay with having BPF programs create attached XDP things? So this isn't necessarily a vNIC. It's like a physical NIC. So the question was like is there an issue with using XDP with vEath? So we would probably never recommend using vEath with XDP in this kind of scenario because really what's happening is they're assigning physical devices or virtual functions into the namespace for mostly low latency things. Okay, so if I understood correctly the SRI vNIC is in the namespace and you're attaching to that. Yeah, because it's inside the namespace of whatever this pod is basically. Perfect, thank you. That's an idea. So effectively what it will do it will then instead this XDP link attach will instruct will rebuild the XDP dispatcher to act on the NetDevice ID. It's an alternative you can think of like directly manipulating dispatcher. Directly, what was the last one? The XDP dispatcher because there is only one drive. There is only one place in the kernel where all of the stuff is called and the way it demaxes like this SRI device is still like Melanox, right, or something. There is still a driver, it's being called and there is a driver and then it goes and checks that, oh, if the NetDevice is one, then I'm calling program otherwise it's not. So this attach what it will do it will establish XDP dispatcher there, populate it, it will generate it with that idea of that device and that's what it will do. So kind of the same, but more hackish way it just directly interface into XDP dispatcher. So what is the- Of course there are plenty of pluses and minuses. Like what I'm thinking right is so in the driver needs to set up XDP, right? And I'm not sure, like how does the dispatcher come into play with that, right? Like I need to call the Indio op on the driver just so that it'll configure that driver to be an XDP enabled. Yes, right. So the driver will do everything, but at the end, like whether it's calling or not, like you can do it on another SRIV, right? So like in the host, right? You want to control from the host, right? So enabling driver to do XDP you can do it from a host. So the only thing you can be missing from the host is to make sure it's called. I need to make- Anyway, this is the hack, it's probably a bad idea. But I mean like the other advantage of having the done from a BPF program directly is that BPF program can hook the kernel where the net device is created. And that way I don't have to try to figure out how to get into the control plane of the customer that's doing this, right? So like having a K-funk do it means I sort of sidestep this entire question of like, please send me an API request that tells me I'm gonna do a net device, please don't use your net device until, you know, like until I've hooked it attached to it and kind of that whole handoff. Because if I can get in there before the net device is actually online by hooking a K-probe or f-entry, I can say. Just saying it differently, like this, I'm not suggesting anything else. So you have this XDPK-funk, but imagine it's this link attached, doesn't have the first argument, it just has the if-index. Because you already are like in some way K-probe or f-entry, when you see and this is a review was created, you're just saying this if-index. Get rid of this one and just- Both of them, potentially. Yeah, yeah, that makes it, and they just have like one program that all the net device is called? That you... Anyway, we should take it off. Yeah, I think, like having a patch. I actually have a question about this dispatcher. It's like one dispatcher for everything. So should we just like... Yeah, so should we just like create a program type like called XDP Global or XDP Universal that will always be called for any packet, for any network device? Like, the problem with K-funk is like you get link, right? Now you need to store it somewhere, so it's not automatically detached, right? It's just like so much complexity there. Like, while what you want is actually just like ignore any specific net device and just like be called on every packet, I assume, right? So maybe it's just like the global XDP program and like if we have to reuse the same dispatcher, sure, not like maybe we create another dispatcher, which is called, you know, like we call both and then like static key if no one attached, so you don't pay any, or have stuff like this. But basically like it's XDP, just not network-specific, not net device-specific. I mean, that would solve this problem, right? And then you wouldn't have to have a K-funk, really. You could do that from user space, right? It would be another way. My question would be like for the XDP side, you would only solve half of your traffic, right? Because like the traffic going out, I mean, you still need the TX hook that we talked about earlier. Yeah, so either, I added this as you guys as we were talking. So either we do just the same thing for the TX sign, right? Like you have an attach for it, or maybe you just have a global TX that runs. Either way, I think would solve the problem for sure. Yeah. Yeah, so that's why I was kind of excited about that talk, right? Because like once you have that TX hook there, then I can have a global TX hook, and then I don't have to worry about TC or what. But if you have like a new program type, then like you can have like expected-type and like just specify which direction. Yeah. Okay. I guess there's no XDP underscore MD on the TX too. So we have to like figure that. Oh, maybe it's not XDP, there's something else, but you know. Anyway, like one other point, like before you were saying that like even if you attach to like through TC or XDP, right? Like then owner of that network name space can detach you. I don't think that's the case with BPF link. Like whoever owns the link kind of maintains that slot. Right, which is why I went with like the link attach. I don't know. But like even before that you were saying, right? Like if from outside you attach to like every network name device that you, like if you attach it using link create, right? Then no, owner cannot just detach it. Like they need to have like FD to the link to actual detach. Or they will have to be kept as admin of course. Okay, so like if I jump into the namespace, attach the XDP and the links kept on a file system outside that they don't own, that should be good, yeah, that makes sense. Yeah, so that's the problem. I think a couple ideas. And then we have the same thing for, what am I, oh, the same thing for TC as well, would be the same ideas. But if we had a XDP hook on the egress, then we wouldn't need the TC hook, right? Which is nice, because then we kind of, if we get that in, then we sort of unify how we do this, right? Because otherwise we'd have the same problem at the TC side and we don't have a link at TC, right? Yet. And then we would need to know when the Q-disk is created and somehow attached to it with the link and so on and so forth. I think the nicer thing overall is to just have this, whatever we want to call it, egress hook on the NetDev inside the driver, because that's the stats we actually care about. So there was this kind of a mirror proposal for TC, but I think if we go with the XDP side, we just skip all that and life is good. And yeah, so that's the gist and the summary would be we want BPF programs to load XDBTX, TCX programs, or alternatively have a global program that runs for everything. I think those are the two possible solutions. Just as an aside, being able to walk Devs and Net and S and BPF programs from F-entry is pretty useful for us for statistics reasons. So hopefully we'll see that soon. And XDP is pinned, but if we do the TCX we'd want pinning too, right? So we can do a link, we can do something similar from the TCX side. I meant link, like a BPF link is what I meant, sorry. Which is kind of pinned, right? Very similar notion. Right now, in BPF land pinning is like when you have link, map, or program, and then you expose it as a file in BPFFS. Like just creating a link, you are attaching a link, you are not pinning the link. Like because if you close that FD, then you automatically attach it. Like there's nothing pinned unless you actually pin it in BPFFS. So it's just confusing, I understand what you're saying. So we would link and pin usually, both things, right? Yeah, so if you create a link and then pin, then you can die and the program will stay attached. So in Tetragon we always link and we always pin, right? Just so that if we crash, we come back up, we're good, sure. Well, actually I don't think any of the stuff we just discussed will work. The pod can like keep up and down the whole device, right? It will go through the whole pretty neat and everything will be removed. Yes. And then the XTP program should come back on the up. No. Even if you have a link? Yeah, so I think you really need like notifiers at various levels that the host can check and like do things. But just like he already mentioned it, that one of the approaches to have a native fire and then do stuff, just like make sure it's not racy. Right. That feels like way cleaner. Back to the link and the pinning, I'm a little bit, could we, why would your link get detached on the up-down? That seems... So for C-group programs, like C-group BPF link and for XTP BPF link, we subscribe to C-group going away or like net device going away and like automatically detach. Because otherwise we kind of like have a circular dependency and like the link will like, and also like we kind of intentionally didn't want to like keep the net device from being destroyed, like just because someone attached the XTP program. That was by design. But a link? Okay, interesting. So like the way it actually happens, like you still have FD to the link, right? Cause like we can just distract the kernel object from under you as long as you have like the file open. But that link will be in like detached state or something, defunct state or something like that. Like if you query for it, it will say that it has program ID zero under it. Which is why we have the XTP attached logic. No, no. It's orthogonal. Okay. Like if there is no device to observe, then like you cannot have BPF program attached anymore. And like we chose not to hold the device if the program is attached, right? So like if the device goes away, like if you attach BPF program directly, we will detach it automatically. But similar things like for link. But a net device up down, probably should keep the XTP if it doesn't. Cause an up down is not the same as a release. I don't think it does for XTP, like even for program attached. We can check, but this would be problematic for like, even if somebody pulled the link out briefly, right? It does keep it when you up down the link. Thank you. Both the link and the program attachment? Another regular net link attach keeps the program if you're up down. I assume link is the same. Yeah, then link should work. Okay, okay. Anyways, I think it's... I think like the link is only detached when you're really unregistered the device. Yes. Yeah, yeah. Cause I recently looked into this for the TC... Would work cause we already hook register and unregister for this reasons with like K probes, right? So we have a K probe on register and register to track like what devices are in the system. I guess what I had in mind is like if someone destroys the like the device and then recreate it, then like it's kind of like a new device, right? So like you can, okay. Then it's all good. Thanks. Dump question. I think we also have the trace points. Did you also look into this or not? Because I mean, you mentioned K probes are slow, but... Are there trace points in the low and... In the X-MIT, I think we do have, yeah. We didn't, I didn't. I think more just out of a default bias to having K probes and, you know, f-entry stuff. Like... Yeah. So I can see why you would prefer a f-entry for speed, but trace points should be faster than K probe. And f-entry would be faster than trace points? Yes, more or less, I guess, yeah. Like raw trace points. We're talking about raw trace points. Normal trace points, I don't know, but like raw trace points. I mean, so we would prefer to f-entry and then drop back to K probe as if we're on like a 414 kernel or something, 419 maybe. But raw trace points are also supported since forever. So as long as you have that trace point, you can attach to it as a raw trace point. The ideal would be to go from f-entry to trace points to K probe if we had to go back backwards, but perfect. Cool, thank you. Cool, thank you, yeah.