 Yeah, so hello and welcome everyone. So this session is about a generic multi-prog API and the first user for it, which is TCBPF. So I call it TCX, where we want to implement links for it. So this was like a longer term goal. And then as a second part of this session would be an update on the meta device thing or for those who haven't heard about it, what it is and why we want to add it to the kernel. So let's start first with a generic multi-attach API. So the goal would be to have a generic reusable multi-program management API so that this would be fitting for longer term, given we see more and more projects using BPF in the wild. And therefore like a single attachment hook in some of the cases is not really adequate anymore. So we want to basically have a way to attach multiple programs and to have the same look and feel for different consumers of this API and to be able to express dependencies between programs. So back at the Linux Plumbers Conference last year I gave a talk about Syllium's BPF kernel data path, ref-empt, and there's also a corresponding patch set on the mailing list if you want to take a look later on. Basically the TLDR in the patch set was that we reworked the TC BPF layer. We added a new fast path and also the management API for it to make it FD-based so it wouldn't use netlink and Q-disk and all of that. And with that, it would also be the opportunity to add BPF links for this because this has been missing for a long time. And we want to add it because from the Syllium perspective, we've seen that some of our users run into cases where Syllium's BPF programs got ripped out underneath by others by accident. So yeah. And with this rework, I basically also added the attach-detach query link creation API for TC so that you can basically use the proc and link file descriptor together with the priority. Priority was used back then because it was the same SVTC, but then the feedback rightfully so was, can we challenge the status quo, as Alexey mentioned yesterday, and can we come up with something better because priorities are hard to use? Users don't really know what to pick and so they might all pick the same, which is what happened in our case. Yeah. The other feedback was also to, just as a side note, that's why I mentioned that TCX here to basically name this layer slightly different. So it's called TCX, basically. So in terms of alternative directions to express dependencies, one of the areas we looked into was SystemD because they have this already for unit files where you have before, after dependencies that you can express and based on the unit file name to have this different ordering. So it would be nice for the BPF side to have an idea. This was basically one of the feedback from the discussions on that patch set. And especially also from talking to different people, for example, like Andre from Meta, there are multiple cases where you have management demons for BPF. So it would be super useful to be able to express the before, after dependency for this case. So the initial design with that that we converged on from the discussion was basically to have something like this, so that we add a couple of flags before, after. And you should be able to specify an FD or an ID. Usually when demons want to query, you get all the IDs back. So it would also be useful to have an ID that you can point to the kernel when you load program that you want to have this next to that given ID. And this should be useful. This should be working for programs and links. And in order to toggle between those two, there is an additional BPF ID flag. And in order to toggle between program and links, we have a BPF link flag. And then on top of this, also first and last, so that you can basically specify, I have my DDoS mitigation. This should be guaranteed to be the first in the whole processing chain, or I have my monitoring to see what kind of traffic, for example, is being pushed into the node, so this needs to go last. And then you should be able to combine all of them as well. On top of this, so the implementation would be for the proc-based attach and detach API as well for the links. And from the query, we should be able to have a revision counter that we can also pass in when we attach something. Yeah, that's a question. So that you can basically assert that when you attach it that this internal state must be that specific revision. Yeah. Just a question on the first and last. Is this a requirement, meaning the second one that tries to say it's first is guaranteed to fail, because they can't both be first? Yeah, exactly. So it's a first come, first serve type of well. And you can even come, and in what I will show later in the examples, you can also combine those. You can say it needs to be first, but it also needs to come before that specific program so all those flags are combinable. And when you say first, last, well, then there's a, if you do that, then it would be the only single one. And so before doesn't mean immediately before, you just mean anywhere before, is that what you mean? Or is it immediately before that specific program? OK. And so as you said, you just answered my other question. If you say first and last, that means I must be the only one of the others. Exactly. Or fail if you can't meet that requirement. And the idea would be like to have this whole thing as an layer or as an API that all the other consumers, for example, the TCBPF, but also future XDP, various others, the C groups that they can just use it. And it would be the same for all of them. Yeah, so basically it would look like this. And just to walk through some examples, like that would be the simple. So here is the implementation for LibBPF when you have the link creation API. So I'm just showing this here. So basically you have as an argument the specific program. So there are two new sections for TCBPF, the TCX ingress and TCX egress. You pass in an if index where you want this to attach. And then you have a flag and a relative object argument. And if both of them are zero, it would just be like an append case. You would just append the program next to like into the list of programs. If you would say BPF before and BPF ID, and you would specify a program number two ID that exists in that area, it would basically make sure that it attaches right before that. If that doesn't exist, it will bail out with an error. So that's basically where you can assert that is the requirement that is my intent that I want to add this. Or for example, you combine them, BPF first, BPF before. And this time a program FD and similar. So now it will make sure that the program that you attach to, like this from the skeleton, it will always be the first. So you said if it doesn't already exist, then you'd fail. So my question is, is there a race condition? What happens if the thing detaches at the same time as you're attaching? Is it non-deterministic then depending on which one you process first? So that is basically locked with the RTNL so that there's no such race condition. I didn't understand that. Maybe we can tuck it off of line, but I didn't understand that. Yeah. Now what? Let's say A says I must go immediately for B. That completes. B then says I'd like to detach B. What happens to A? So the first request, that is basically serialized, right? So the first thing. I understand it's serialized, but when B detaches, does that put A into an invalid state? Because A has to come immediately before B. No. And that is just for the point where you attach at the time where you attach it. But if you're saying A must come before B, and if you're not going to start up A unless B already exists, then you can still get into that state if you don't. So it seems like you're treating B before as a kind of a permanent condition. It's not. It's just like for that particular attachment. It's like immediately before that as of now. And if something changes after that, that's fine. That's sort of the idea. Because otherwise it's very expensive to keep maintaining this invariant, right? I understand that answer. It sounds like the model then is to say if it fails, it's because something must have just changed. And I kind of repeat attaching in a loop or something because after then reread what's the state and then see what I'm going to go before now, right? That's your intended model, right? Is that if there's a failure, it means that there's something that hasn't hit steady state yet and just keep retrying, or you could, right? You're not forced to, but you could keep retrying to say, well, then I want to be after it right before. The next one that would have been after that. Even better, you can actually specify expected revision of the attachment. And if anything changes, even something that you don't care about, you can still get a failure and retry. So that's the expected, whatever expected version or something. Yeah, yeah. All right, another example like you would go first before and then you specify a link in that. So given like in the sense of time, so like, yeah, I think it's quite flexible in terms of what you can express here. Then we have the first and last, which case it will be the only one in this case. And yeah. Then with the revision, I added like a second API to LibBPF, but we can discuss it, whether that makes sense or not from your point of view. I mean, I wanted to keep the other one simple in case people really don't need the revision thing. I mean, if we follow the API design of LibBPF, I would keep the program and if I have index as like arguments and then everything else is there opts as truck. Yeah, that's fine. I can do that. Then you just have like one API, which is optional stuff can go into a truck. So if I understood that right, an advantage of the revision stuff is if you're in a particular state and then two apps are trying to do an attach at the same time and they both say before the same one, right? Then there's a race conditions, which one goes first where revision would prevent that too. So maybe revision is actually the more recommended one. I don't know. I mean, if you care about strict ordering, right? Okay. So that is basically where you can program it. In terms of the UAPI extensions, so for the query UAPI, well, for networking, I added the target if index, then like there are two like two arrays that we add that we extend this one is like a link IDs and then the link attach flag. So basically when you would dump the internal state, you get the number of proc IDs. They're always filled when something is attached. And then you get link IDs in addition, which have the IDs of the links obviously as the name says, and they are being filled and otherwise zero if there's no link there. And then you have a proc attach flags and link attach flags. So that is quite flexible if you have proc specific flags only in the future. In this case here, like the first and last, like those two flags that need to be permanent. But yeah, so that is how that looks in terms of the flag extension. What I mentioned earlier, I think I can quickly skim over that attach detach UAPI. So that's quite similar. We add the if index for networking as well. And then just the union for the relative FD and ID and then the expected revision. So it's not that big of a change to the existing BPF attra, the link creation. So here, a lot of the parameters are already like from the generic common fields that we can reuse. And the only thing is given, yeah, here we need to add this to like a link specific section. And in this case, like the TCX because we cannot attach new fields at the end of it. But yeah, that's just what we have today. So you have to extend it in this case here. For the internals, so basically you have this BPF M proc concept. So it's basically like an array of items that you want to iterate to. I made this an area instead of lists so that it's more cash, so that you have better cash locality. And the internal state, what we store here is basically the proc pointer flagged for flags that need to be persistent and then ID in case of the link ID. And you always have a pair, an A and B pair, that then when you do updates to the internal state, you would populate the opposite pair for this area and then swap it out. So that you don't need to allocate memory on detach so that this doesn't fail. So that's the implementation here. And so yeah, like that is basically being added for TCXpress as a first user. And the way we run or like the way we execute this in the fast path is basically to just walk to the area whenever it says next as a verdict. Then we go to the next program and then at the end there's always like a null entry so that we exit out of there. And if you look into the sketch handle ingress or sketch handle egress, so that is basically something that I reflected a bit. So you have the static key where you run the TCX programs first, what you can see here. And then if we have a verdict where we don't stop there, where we stop the pipeline, we can directly go to the individual actions or if there's a way where you need to be able to collocate both like the new style TCX versus the old style TC programs, you can say from your TCX program to do the TCX next. So then it will jump into the whole Qtisk Sealers Act Qtisk. Daniel, Tok has a question on that. Tok, go for it. Hello, can you hear me? Yeah, we can hear. Cool. So about the Multiprack API. So first and last flags are permanent, right? Yeah. Why is that useful and what happens? How do you resolve conflicts of two programs at them? I think, well, why is it useful? I mean, you kind of want to ensure that in some of the cases, like the DDoS prevention that you don't accidentally put, I don't know, monitoring in front of it so that you would DDoS yourself or I mean, that's like probably some of the reasons I could think of. Yeah. It just seems like someone will just grab it and lock everyone else out and then you will have the same problem again just in reverse. Yeah. We are thinking, I mean, this could be extended further. Like if you, for example, have just some monitoring use cases where you don't actually change the state of the SKB or whatever and you just want to have the inspection. Maybe we could add another flag where you have like a read-only and then you would also be able to attach into this first and last as long as you don't change the internal state. It could probably be something we could think of. Another option that we also considered for libHTP but haven't implemented yet is some kind of admin override where you have an API to clear the flag if something is misbehaving or override what the program says for these persistent flags. So I didn't cover it here but like for the replace case, that should be an option. Okay, right. And the second question was with this as far as I can tell, it's not possible to atomically replace multiple programs at once. Yeah, that's correct. I mean, well, you have to, well, yeah, you cannot. Like the only other thing is if you would expose the whole area but like when we discussed this, like, yeah, I think it makes things more complicated and error prone as well, in my opinion. I'm kind of picking up on that first last question that Toke asked. It's interesting because it ties into what Dave was asking with like, what does after or before mean? Like first and last are very different than after and before. And if we remove that, if we said first and last are also just a concept that applies at attached time, not always, then that wouldn't make the API much more consistent, right? And you're always gonna have a conflict in this thing of saying, oh, we need the API to mediate what's gonna be the first program or the last program, I think is basically the wrong approach. Like you need to do this in your user space management layer where you're saying, well, the DDoS thing is the most important and so I should start that first and then the other things come up or however you do it. So I think I kind of agree with Toke in saying that the first and last is maybe a bit special. So it's, okay, so two things here, right? Like first and last year they are special, but like they are important, right? Because like for example, when you think about LSMs, right? Whoever runs first can like reject everything, right? And like if correctness of like the overall system depends on like your BPF program to be run first, that's a hard requirement. And if it doesn't, that's not satisfied and like you cannot start and function correctly, right? So in that sense, first, last is actually yes, different than before and after. But I'm wondering if people are actually confusing the first and last as like a prepend versus a panned. You know, a prepend, a panned versus like really first and second because like for, if you want to add something as of right now to be the first one, but like you don't care if someone like attaches after that before you, then you just specify before and like object zero or something like this, right? And that means prepend, similar like for a panned, you say like after zero, zero means like no FD, right? But that means just like after or before anything, right? So in that sense, yes, first and last, I think are important for correctness and they have to be persistent because like if you don't persist them, right? And like don't enforce them then like what does it even mean? It's basically prepend, a panned only, right? Which you can express with before, after. Yeah, that was why I asked why it's useful. So I think this also comes back to the discussion we've had before with like the difference between a system where you control in all the programs and you can have this kind of policy and the system where you are loading third party applications that just pick flags for themselves and some random developer has said, I wanna be first and then they lock out everyone else. So like and that's gonna happen so there has to be some kind of override, I think. If they're useful for a use case like I'm okay with keeping them as long as you can sort of admin override them in some way. We also talked about having a hook that a policy demon can hook into and reorder programs but I guess that can be added later as a way of doing this. So I think I am somewhat partial to Toki's argument because we've seen conflicts and before and conflicts and after and other things not just BPF programs, right? You know, firewalls or things that oh, I need to be first and something I'm ready to BPF, the pre-dates BPF certainly on the Windows side and I think on the Linux side too and so it gets to, you're going to have collisions what are you gonna do and the answer right now you're giving is first come, first serve and everybody else loses, right? Separately I think there's an interesting thing to Lawrence's part that says, and Andrea you mentioned will people do the accidentally use first when they just mean I wanna be the first one on the list right now? Maybe there's a case for having a flag that's currently first separate from a flag that's always first, right? Because right now it is before with zero, okay. It's before with zero where the zero was the magic no more that just means I'm first on the list. Okay, that wasn't on the slide, thank you. Okay, so for the like, I mean we are talking about like end goal is like some production system where like you know maybe multiple systems try to coexist, right? So like yes, definitely there should be some sort of agreement like gentlemen's agreement that like if you don't really need to be first like you shouldn't specify the first, right? And like if you do because you don't know that's a bug you will report and someone will fix it but we have like for the force detachment that has to be kind of a human or like Capsis admin like enforcement and like detachment and all this stuff and like we have link detach, right? Like where you, it's slightly different because link detach leaves the link in place but it detaches the underlying program so maybe we can reuse that or like have similar approach here but like you can forcefully detach even if it like says like first or last, right? But it's more about like forceful detachment than like the ordering. You should be able to force detach any detach thing if you are admin, right? I think it's just what we're talking about. Okay, then a final point, maybe the first, last is a bit like as we said like people interpret it differently maybe it needs to be like always first and always last, force first, I don't know whatever yeah, Henry said that, I don't know. Maybe it's the hardest, yeah. Okay, always first currently before is what Dave Fahala says. Okay, just add a bunch of syllables. I have to move on a bit to not over shoot too much in terms of time. Sorry, I have some additional questions. I'm sorry. Like for this API you said you take both IDs and file descriptors and you take both like programs and links, let's start at the IDs, why does it take IDs instead of file descriptors? Cause all the other APIs we have as far as I know take file descriptors, right? So when we discussed this a bit, I mean like the rationale was like, well like that management demon that would consume this API it would query and all the time when you query you get back IDs, right? So then you have this additional hurdle that you need to get the file descriptor from the ID and then so you could just as well. Is that a hurdle? So yeah. I don't know, is that really a hurdle? But I mean that daemon that you want is not going to be kept. So Andri says getting the FD from the ID is capsis admin. So you need capsis admin to be able to specify the other thing, isn't that what you will want? Well the idea is that like it's not always like one centralized demon, right? Like it could be multiple independent applications that you just have like they know about each other potentially, right? And like we just don't have defined order like in which they start up, right? But like when they start up they can actually go and somehow discover they're like, okay some other program is already running and I want to run after it, right? And then they will just like query ID, maybe like ID is stored somewhere on the file, right? So like they don't need capsis admin if they can specify this ID. They can get these IDs through some other means, right? Okay, then the follow up is like do we need to add IDs to all new APIs or is this, I don't know. I'm sorry, but it kind of begs the question, right? It's like. Like if we do this for all APIs. Yeah, I mean like in the future like do we, like I mean like the idea was like for this API to be able to support this in different attachment points. So in that sense, yes. Okay. All right, moving on. So the meta device for BPF, the idea is basically overall for the Selium use case that we have, we really want to have the same performance for applications inside Kubernetes pods, AKA network namespaces compared to applications residing into host. And just because you moved them to network namespace shouldn't incur a performance penalty. It currently does. So we did some performance measurements. The first case, what you can see here like in turkeys is basically the reef and the upper stack forwarding. The next case is some of the improvements we did in the past with the, so we call it BPF host routing but it really means like to use the BPF redirect here and BPF redirect neighbor for traffic going in and out. And the last one in this comparison is basically the baseline, the host performance. Why does this suck? In terms of the, yeah, like when you go up to the stack the socket is basically, the SKB is often from the socket and that basically breaks TCP back pressure. And there were multiple attempts in the mailing list to remove it but it all boils down that it seems to break net filters T proxy, so it needs to stay there. But that is the, yeah, I think that is the main cause why the performance dropped so much. Like in this case, basically the socket is retained all the way to the physical device and it stays there in the Q-disk if it has to and then really only TCP inside the pod gets the notification and yeah. So the question is can we get to this point? And so yeah, so we added a new device driver as a ref replacement. We call it matter but I'm still open for other names if you want, like if you have good suggestions, good short suggestions, naming is always the hardest but I thought because it can mean a lot of things depending on what the BPF business logic in that driver would do. So basically the core idea is like for traffic going out of the pod to shift like to move the BPF TC programs that are currently attached to the on the TC Ingress on the Weave site into the driver itself so that basically that BPF is executed at the XMIT routine in the device. And then what we can do there, we switch network namespace immediately, then we can do the fit block up and when we see that it's going to the physical device we can directly redirect it from there without going through a per CPU backlog queue. So that is basically the core idea behind that. Question is yeah, what about XTP support because we first XTP do we also need to add it here now and I'm saying I really don't want this because I mean XTP if you look today into the weave code it takes up three quarters of the weave code. So it's really super complex. If you want to use XTP just use weave. But really the idea is like for the Kubernetes use case I mean like well XTP in the physical device that definitely makes sense and after that you get all the GRO batching for SKP and that is absolutely fine. And in terms of program management we would reuse the same API that I talked about earlier. It's still like a main and a peer device so you have two devices and the idea would be only the main device that you set as main device it would reside in the host namespace and only that can control the program management for both devices. So you can only update the peer device BPF program from the main device so that no other entity inside the pod can somehow detach it. And the whole device would be an L3 device so that you have it as a no ARP device. L2 mode could be configurable but it wouldn't be the default but it would still be useful I mean when we talk to some of our folks that do BGP with Scyllium I think it's still useful for testing. The other feature for that device driver would be that if no BPF is attached you black hole all traffic so that nothing gets leaked in between like switching network namespaces but that again is also configurable. And the idea would be to have compatibility for TC BPF so that you can then move the programs from TC ingress weave into this device type. And if you look here into the X-MID so basically here we switched network namespace and we execute the BPF program and if the BPF program says redirect you redirect directly to the physical device without going to per se pure back lock queue. And if you look into flame graphs so what you can see here on the right side how it looks it's all done in the process context so there's no rescheduling point as in the case for weave when you really have worst case scenario. And looking into performance this is basically where we get to so we get really on par with the host same as like that's for the TCP stream case 500 gigabit same as the TCP or R where we get the lower latency so it's really on par. So yeah in case of Syllium that's basically one of the last missing building block that we want to get to. So we basically here we have the bandwidth manager with FQ EDT where we would then also be able to support BBR from the pods. We do have this BPF host routing where we like for inbound traffic do this redirect peer for outbound traffic to redirect neighbor then we have the reef device replacement so that we really get to the same performance as I showed earlier in the graphs and then the other thing is like big TCP that we enabled but even with big TCP we get lower latency so that's what we measured here as well. So yeah that is the proposal and one small open question I wanted to discuss is that given that's coming as a module like I think that there are two options I think like as far as I've seen the net filter BPF basically has this Boolean to configure it when the system call is enabled we are able to create links for it or whether we should have some kind of registration API for this device so that then we can have all the logic around like this BPF Mproc handling inside the device driver and then when it loads it just registers to it and then it will do this NDO call to delegate but yeah I can, I'm leaning towards the letter and I would try out that for implementing it as a next step but as far as I recall, network namespaces is not a config option and these is so built in the kernel with net and s well built in the kernel with net includes net and s but then these, without these this is completely useless configuration. Yeah that's true. So it was sort of trying to do the same if you just make it depend on net and be a part like don't make it a driver like why does it have to be a driver? Like these didn't have to be a driver. All right that's fair yeah okay. It's like crucial part of like net and s net and s is not the same thing here it's like it's core it's not a driver. Yeah that's good point. Yeah that sounds good to me. Last slide so basically the generic multi attach API so that is pretty much all of it is implemented and working so I'm still adding a bunch more test cases and make it nice ready for upstream submission and then once that lands that's also the prerequisite for the meta device so that's like the next step after that. Yeah and then I'm thinking to look into adding this multi attach support for XTP as well so that we have this for both TC and XTP. Yeah. We want this for C group role as well and like maybe for like LSMs I don't know like pretty much for everything like even where we have multi attach right now like we need more control over like the order. One more point about like the what if someone attached was sorry going back to the first part the first and last right maybe we should add another flag like force and then if you have Capsus admin then like you can go and basically detach the existing first or last or something like that. And another question I have like when you go to the very beginning right like with Mproc right like you're using this Mproc at runtime to like iterate over programs. If you want to optimize for cache right like why would you collocate the program pointer with the flags like you don't need flags at runtime. I would split them. Just for management yeah. And also I don't know if that's good that like we just allocate the maximum lens of the array. We potentially will be wasting a lot of time right there. Yeah. You probably wanted to simplify like you know like not fail on real log and stuff like this but maybe there is some middle ground. Yeah. I think that could still be. I mean yeah. I think we can still yeah we can it's not like you API where it's baked in forever. I agree. Cool. The question on the meta device. So you mentioned if fall to the you then you will you will directly do redirect without going to the bad log queue. So how about for the packet going up to the hoax. So that is already the packet going up to the host. Yeah. That would go to the back lock queue and that would take like the same path as the we foot. Cool. So I don't want to overshoot too much. So thank you very much.