 All right, are we good? All right, so good morning, everyone. So I have two topics to discuss. So one is the, like one topic is the TCBPF rework. We discussed this on Monday, so I'm not going into details here, and the other one is that I wanted to bring up. It's a proposal for a microbeef driver. Like what is the overall goal? So the overall goal that I would like Sillium to get to and with that BPF as well is to basically have Kubernetes part networking with the same efficiency as if an application would run in the host. Just to provide some context, from the past there have already been steps that were conducted that went into this direction. So one thing that was quite long ago is, for example, to retain the SKB socket association across the network namespace switch. And the other thing that we have in Sillium that we built is basically to do all the forwarding inside the host namespace out of the TC layer so it doesn't even go into the upper stack, thanks to a couple of helpers. And the latest thing is basically the SKB timestamp preservation. So if you look at all of those, they kind of have a common theme. Like in my opinion, one is to hold the SKB socket association all the way, like from the parts network stack to the physical driver when you send the packet out. And only then, once you send the packet out, the TCP stack from the pod network namespace has been signaled that the packet is actually on the wire. So it gives better feedback, for example, for TCP TSCube. The other thing is to retain important SKB metadata that shouldn't be scrapped, for example, like the timestamp. And the other thing is like an efficient network namespace switch. You don't need to go through the backlog queue when you switch network namespaces. So just to provide a quick context, so basically when you have BPF program attached to the physical NIC, it can do the BPF redirect peer, like to have a fast network namespace switch by just resetting the device and then going into another loop in the main receive handler. And that's pretty much it. So it's pretty much so it's really zero cost to go into the network namespace. And on the way out, the setup that we have right now is basically on the Weave device inside the host namespace. On TC Ingress, the program there that we have is doing flip lookup. And then when the neighbor entry is in the table, it will do just directly a BPF redirect. Or if it's not in the table, it will use the BPF redirect neighbor helper. So it will push the packet into the neighboring subsystem to then resolve the ARP, for example, and then forward it to the physical device. And the good thing with such a setup is basically that, as I mentioned, that the socket association to the SKB is preserved all the way until the NIC, where there's, for example, MQ and FQ. And then FQ can do its job, because it also looks whether there's a socket attached to the, like, associated with the SKB, and then it can queue it properly. And with the recent work that was merged into the kernel, like the delivery time is basically preserved. So I think that's also an important step towards this direction to get the networking more efficient from the pod itself. So looking further, one thing that still is not resolved yet is when you have traffic that is leaving the network namespace. So a typical example for the Weave device. So it's basically setting the device to the Weave peer in the host namespace. And then it's queuing the packet to a per CPU backlog queue. And if needed, it schedules another NAPI instance. And then that network action will basically pick it up. So there's this process backlog thing, which is one of the polling callbacks. And then it will pick the packet back up from the per CPU backlog queue. And then it will push it up into the upper stack. And I think, like back then, this was probably done just to avoid the kernel stack overflow. This was maybe, this would be my assumption when the kernel stack was not as big as today so that you get the fresh stack. And one thing that I'm planning to experiment with is to just directly call into the receive routine from the device driver's XMID. So you don't even go to this backlog queue. So this would be my proposal. Whether this grounds a new, let's say, microweave driver or not, this can also be for discussion whether it should just be in the weave driver. I have my opinions. We can elaborate on that. But just to gather some feedback, do you think what do people think was there some experimentation done in this record already from some of you or not? The point being that you would have the TC hook inside the receive SKB, right? That's the point, right? You want to get to the BPF program, right? So I basically want to avoid the packets that they have to go through the backlog queue so that they will go directly to the receive path. I mean, could we just put like a XDP program that does the redirect in there? Call it on your. Yeah, like, do we even need to do Netf receive SKB? Can we just call, like, when the VEath does the SKB forward right to the other VEath? Can we just call, like, an XDP program running on top on that VEath? Be, like, on the peer? So why would you do that? Because it's, like, XDP receive. I mean, at that point you get your BPF program and you don't have Netf receive SKB, you don't have anything else, right? Like, it's just your XDP program is there, and then instead of doing the Netf receive, you just put it on the transmit queue of where you want to go. Just do, like, a redirect. Like, this is an XDP, in quotes, redirect, because it's kind of like, I mean, it's an SKB, it's not really an XDP. But, like, you could redirect it directly. You wouldn't have to get in this receive SKB thing at all. Oh, okay, I see what you're saying. I mean, like, I was just wondering, like, regarding XDP, I mean, like, at this point, you do have an SKB, right, to transform it to an XDP. Yeah, yeah, I mean, I just call it, like, you, I wouldn't actually turn it into an XDP buff, but, like, the hook point would be equivalent to XDP in the VEath. Like, like, so, think about, like, XDP on a real Nick, like, would be, is in the receive path, right, before it's ever turned into, before it's given to the stack. So this would be, like, XDP in the sense that the hook would be inside the VEath driver on the receive when you do the SKB forward. Like, you know, when you switch the dev, and then you, in theory, you scrub it. Like, we could just have an XDP hook there, and maybe it handles it with SKB, right, like, you might need to do some sugar to make that work. And then you're saying, like, from this hook that is inside the DevQ X-mit of the driver, of the VEath driver, you would redirect it out of there directly. Yeah, I would just say, like, redirect to my peer, and then you would just put it on the TXQ of the other VEath. Like, you would never touch the stack. Like, the whole flow would be inside the VEath driver. Or, you would go from the VEath driver to the, like, a Nick driver. Like, you would never have to sort of leave the, sort of, ecosystem of drivers. I think that makes sense. And it would still also be compatible with the existing programs, right? Because they are at the TC layer, and whether you first, I don't know, process, disappear down, you don't care, right? It's still, like, nothing, not much more should change. Like, maybe, like, you would do the, potentially, like, the scrubbing before that, but you could even do that in BPF. I mean, why would you care? Yeah, it's like, the only sort of, yeah, why would you even scrub it? Like, you wouldn't normally scrub a Nick driver, like, a Nick receive something, you wouldn't scrub it. So, like, I would say just, yeah. Don't even bother to scrub it. Yeah, I think that makes sense. Like, one thing that would also be useful with that is that you can drop packets also earlier. Like, you don't even need to go to the process backup just to reach the VEath on the host namespace. Yeah, and there's no, like, am I in polling mode or not in polling mode or, like, all this other stuff, right? Like, it's not, like, when do I turn that VEath off? It's like, just drop it. Right. And the application inside that network namespace is not allowed to modify the program in any way, so it's always controlled by the orchestration and inside the host namespace. Yeah, because it's largely part of the left UV, right? Like, regardless of how the code actually calls it, logically it's part of that UV on the outside the pod. It's like, it's receiving it there. I mean, that's the only reason I say XDP, because that's kind of, that looks like XDP hook at the source location, even if the data structure is not the same. So I guess then the question is, would you have, like, one BPF program that is associated with a driver or, like, two? I think probably you would have, like, one for each, we've paired twice, right? I think so when you stick with the... Like, when you come from the host and go into the pod, you would need something. Yeah, yeah. Because this case is different compared to this one, right? I mean, here you would just skip the program that is... But that should be okay, right? That's okay, because you have your program here, you can do policy enforcement and so on. But for applications that are in the host, they would still have to go to this. And there, you would probably have, like, a BPF program that is executed in the def QX mid on this guy, and then one for leaving. I mean, maybe we finally, like, invite the bullet and do TX, XDP programs. Right? TX, XDP program. Okay. Like, we only have RX now, but we could have TX just hasn't been very compelling so far, but we could do it, right? I'm a bit skeptical about the... So what would be the advantage? That you would get the call basically there. But you could just do a TC there, it's just a little asynchronous, like it's asymmetric and slightly, maybe the controller, it's probably fine. I mean, that's been the argument against doing XDP TX all along, right? We'll just put the program in the TC side. So you'd have, like, an XDP on RX, and then a TC TX program. Yeah. One would work on XDP buffs and quotes, and one would work on SKBs. I don't think Sillium code would care. It's fine, right? Yeah, I know. It's all abstracted away anyways. Yeah. And then you would never touch the host stack. Exactly. I still like doing the flip look up for the host namespace, I think that's important, but the other things are not needed. And then all the SRRV folks would be like, there's no point. Yeah. Right? This is basically would be SRRV at that point, like you just plumbed it directly in. I was thinking also to put this logic, so let's say we would do a BPF program that is part of the driver, like of such a weave driver, right? What we mentioned earlier. Like in XDP, right? Like if I think of the flow, right, there's the SKB forward, and then I would call it in XDP run immediately after that. Right? I wouldn't even do like... SKB forward, just call it in me before SKB forward. Yeah, because you're gonna drop it, if you're gonna drop it, don't worry. Like just have a DevQX mat, and it calls the BPF program that is sitting on the driver. And if there's nothing sitting on the driver, it's just K3 is the SKB. It's just called the driver. But logically that's the receive, like we can play like software optimizations in the driver, but like logically from a model, that XDP program should be on the outside with the UV. Logically, it would be in the outside, but it's actually inside, yeah. Doesn't matter, right? Because I think we're saying the same thing. I think that makes sense. I was also thinking like to maybe put this into its own device driver thing, like I don't know, like right now, it seems to me like half of the weave driver code is now XDP related, which I think very few people use, and also partially sitting in fast path. So it's not needed for this. And the other thing is also like that is related to this, where people now made the weave driver multi-Q, which is, it's actually broken, I think. So let's say for example, you would have like 16 Q pairs, like for your weave driver, and then your actual physical nick, it would maybe have more than that. And the network stack actually sets the SKBQ mapping for the weave because it's now multi-Q. And that thing is retained even all the way until the physical driver, and like some of the drivers, like the one for AWS, they will just take whatever is there, so you cannot use the full range of Qs. So I lost track of this. I mean, like when I was working on it for XDP, like it wasn't multi-Q yet. Like what is the point of having multi-Qs? Like why don't you just use the thread that you're on? Like why, like there's some context, you submit it and then you're on, Dev Q X is called from somewhere. Why do we need to have it? Why do we need any Qing at all in VEath? It's like completely software. Yes, exactly. But I think my understanding is that this was done because of XDP. So when you have an XDP program on your physical nick, that is processing all the RX stuff, and you want to forward that to one of the VEath devices, you would need it there, but... But why? Right, like you do, like fundamentally, you just want to call it Dev Q X-mit, or put it on the receive queue. There's nothing there. Like the reason you have Qs in a hardware is because there's DMA, like you need to like fetch the descriptor. There's no descriptor there. Like it's just an SKB. Yeah, that's why I would like to make it minimal again. I think we stuck on the slide for too long. Yes. Okay. Next topic. So the second topic is around the socket hooks for UDP and TCP in particular. So in case of Syllium, we use those socket hooks for connect, send message, receive message, get peer name and bind for the east-west load balancer. So the way it works is that on connect or send message, if the destination in the socket address structure is to one of the service web imports, it will select the backend. And like for traffic coming back in on either receive message or if the application is doing a get peer name, we will do the reverse translation. So we will just lie to the application that it's actually connected to the service web instead of some concrete backend. And we also use bind to block some of the bind requests, but it's an irrelevant detail. And so like the connect hook can be used for both UDP and TCP. UDP is maybe a bit more special. In the case of UDP, you can actually use connected UDP, but also unconnected UDP. And both even at the same time for the specific socket. If you, one thing we have seen is where some DNS resolvers in the wild, they use connected UDP. And there's actually a problem that we run into. So given it would do connect, it would pick one of the backends. Like for resolving DNS, but then at some later point in time when the backend goes away, the application doesn't notice it. So it will just try to continue sending traffic to that because there's no signal that would be propagated back to the application because like the send message hook that we have, even for UDP is not called in that case. So it's not invoked. For TCP, it's usually not a problem. I mean, it's not particularly nice, but it would get a reset if it still tries to talk to the backend even though it is going away. And this is basically how it looks. So you have the case where it's not connected, we don't call into the send message hook. And one proposal, yeah. So in the UDP case, you should at least get something like a port unreachable or address unreachable ICMP. Is there something you can do with that? Yeah, I mean, right. As the network is getting a notification that the backend went away in that case, right? And so it's just a matter of making sure the application can get it somehow. You would have to make sure that the application can get it somehow, like out of band, but actually what you actually want is, I mean, there are still other backends where you could do the DNS resolution, for example. So you would just have to, that's what I'm trying to propose here. So like a new hook for the send message case when you have a connected socket. Can, for example, be a connected UDP, that's what we definitely need. And connected TCP could be an option as well. And I was thinking like to have the input context similar to the one for bind, where you just push, we just put in the socket. And with that, you can look up whether the socket is actually associated to one of the backends, and it's talking to an actual service. And if that still exists, and if not, then what we could do potentially is to have a connect helper call, and then you would reconnect this to a new backend for UDP, so that in the kernel, like for that socket, it would cache a new DST entry, and you would basically tell the kernel now like to use the new backend instead of the one that went away, actually. It's a bit ugly in the sense that you would then have to call this program and do the check every time connected UDP does the send message, but... Yeah, that's what I was gonna say. Like don't you want out of the send message? Like don't you want to do like almost a callback and then change the socket? Like we should have some event that the, like where do you do the delete out of the map? Right, cause that's where you really want to update the socket. Cause otherwise you're in the data path, like, which, I mean... But then you have to track all the sockets and you note in all the different namespaces that would be talking to that potentially. Meaning like a connection tracking basically, right? Like what's the, what are all the socket maps to this backend? Yeah, but I don't know if you want to be in the send every send, right? But if I get the problem correctly, like the other side you will know that other servers in the other port is going away. So you just, well, at this point on your local host, like iterate all of the sockets, all of the connected UDP sockets that potentially will not get any notification. Just iterate all of them, see what they're connected to something that's dead and just force whatever, reset them, not reset, but do something with them. Reconnect them underneath or tear them down somehow. I mean, in the case of UDP, you would have to select a new one, but then you have to somehow keep track, so you would need some mechanism to track all the sockets that are related to a given backend, right? Don't keep track, like when you know a backend is away, like when this host like see whether I connected to because that's the only problem for connected UDP so you know where they're talking to. Iterate all the sockets, see the destination IP port, and if they're indeed of the service that's going away, then do something with them instead of like getting another hook. Because the problem is a slow pass, right? That rarely happens. And here you want to like sacrifice fast pass for this, just feels odd. I agree. So the proposal is to have like a socket iterator through all the namespaces and then see if it's connected to that backend and then select a new one. And for TCP, you could probably have something like the, like with SS tool, I think they added like the socket kill mechanism. You could, it would make sense to just kill the socket so that it doesn't even emit packets on the wire. A helper to kill a socket would be useful in general. I think we would use it for other things, like security things, right? But right now we just debit all the packets when we're like, yeah, we don't like these sockets anymore, but it would be nice. Or sometimes we can force a reset on TCP, but it would be nice in UDP to just like. I agree. Yeah, that would be useful in general. So, yeah, okay. So I will look into the socket iterator and we would still need this BPF connect helper so that we can select a new backend from there, from the iterator, but that would solve it, right? That's good. The new hook would be specific to UDP, correct? Or is there any case that you'd ever think of, to say you want to use that hook for TCP? So for, yeah, I mean, for TCP, like. I mean, if it's UDP only, call it BPF UDP connect or something then, because otherwise people will be misled to think that it works with any sockets. Oh, okay. Just like the BSD connect API, right? Connect works with UDP and TCP, like you said, right? And so that's why either make it generic enough that it could work for other things or enable with UDP. Yeah, I mean, we could only allow it for the UDP, like for UDP sockets specifically and not for TCP and later on, if somebody cares, it could be added, but yeah. Well, if we do more bike shading, it's not really connect. It's like connected UDP is not really connected to anything. No, no. So that's why the whole name is like. It's just caching the route and the socket. At least there's some advantage in matching the classic BSD sockets name. Yes, it's a misnomer, but it's a misnomer already understands, so. Yeah, I agree. Cool, yeah. That's all I had to.