 All right, ready to go okay. I want to talk about two things on the networking side really quick So the first one does this this is the first one? I I recently hit this problem. Yeah Then I can't point at my picture I'll make do like this I think All right, so I was gonna do a quick stock map update on some of the stuff we're working on stock map But before I do that I wanted to hijack it slightly to talk about another problem that we recently had so If we look at like how you're going to grab packets off the device at the lowest level and you want to have like per packet view so typically we do this with TC filter right now we add a attach a q-disk and then we attach a BPF filter to the q-disk, right and This works if you're able to attach the TC filter to the q-disk and the problem that we've recently hit It's sort of two-folds one is you have to know when the when the interfaces are being created so that you can attach a filter to it Right, so you have to go through this step where you get notified that a new device was created Then you have to attach your your your filter which may be some ACL policies or maybe some observability piece And then you have to tell whatever the thing is that's out there that it can now continue to bring up the whatever's using the interface like a pod or in kubernetes speak or container and Sort of other languages, so This is sort of problematic because now you're trying to trying to interleave yourself into this control flow And you need to make sure you get that correct if you get that incorrect You basically have a gap where you don't see any traffic that that container was being sent So that's that's the first problem And then the second problem is in a kubernetes space if you look over here We have two pods and my sort of stick diagram here And those pods actually can have other devices in them that we're not even told about from kubernetes so like if a Application creates another loop back device We don't have a filter on it So we see we don't we just are blind to that traffic or in the sR Iov case which we've actually seen a few use cases of in the production Where they actually have other devices outside of kubernetes that they pull into the pod for high-performance stuff low-latency type things But we don't have any visibility there either because there's no There's no signal from the control plane that this happened or at least that control plane is a different control plane And now we have to go instrument that secondary control plane to tell us This was added and now we're back to this whole flow problem control full problem Or we need to make sure they don't send traffic Until after we've attached and even worse were we're actually silly I'm for example is running in the host and so somehow it has to reach into the container and say attach my filter and then Somehow tell that container not to detach it Which you know is sort of not plausible So there's there's some some workarounds with this to kind of defeat this is you can use like a k probe I just attaches to dev qx mitt and The received site and you will see every packet regardless of when this is created right like it doesn't matter You don't need a signal you'll get a you'll get at your k probe called you'll have a pointer to the device So you know what device it is and then And then you can do whatever you need filter-wise or or whatever So The only problem with the k probe is Because it's a k probe. You don't know that it's enough you you're sort of in this k probe space And you don't have all your normal skb helpers So I thought about this and there's there come two ways that I was proposed fixing it One is to put like I just a dedicated hook there That's right where we have the egress hook now But abstract it away from the device so that you don't have to attach You attach maybe at the c-group level like a c-group program and then Which you could attach to the root c-group and then it would be called on every packet And you would just get a device pointer and the skb Which is well you don't even need the device pointer because the device point is embedded in the skb at that point So you just need the skb and then you could run all your skb helpers over the top of it But you also would always be guaranteed that your program is run and wouldn't need to worry about this ordering problem Like six a group address. I thought it's not device specific. It's like scope base a group. That's it Yeah, is that the IP layer right? Yeah. Yeah, I want to go lower one lower to the L2 layer wait, but I thought you only want this to like find out which device was created you could like k probe Net dev create Know the device man and then attach to it So I want a whole like netlink notify I thought netlink has all the all of them right notification mechanism from net devs coming in going It has the notifier, but they're racy like you get the event, but it might have already been created So but but really I think that maybe I didn't get the point across is what we want is to see every packet coming out of like Lo inside the pod or ETH 0 if they spun up an ETH inside the pod and we want to see it like a packet hook So we see every L2 packet coming out of the device So you want see groups coped hook, but done that Yeah, just one layer lower. So we have IP layer now Right and we have all the TCP see group hooks for everything in the TCP stack And we use the IP you have all your for UDP and everything else But we want to down at the packet level actually so we know what device is the main problem with the C group scoping at layer 2 that there's no a C group right there's no escapee right, right? Or if there is a ski beat could be like carol just doing yet exactly empty you probing Yeah, that's a group specific But I but I actually don't care about the C group scoping because even if I had a hook that ran on every packet That went through dev QX man, which is like the last call before the OS before the driver, right? That would be enough. So I don't actually I don't actually have a use case for the scoping It would just be like it would be where the TC hook is but without the TC context like I don't I don't care about the Q disk I'm not trying to I'm not trying to implement a queuing discipline I'm just trying to implement a filter or observability point And I think you also want you also want to have the case where the application in the pod itself cannot just remove it Right exactly So that I don't even want the pod to see it really ideally the pod has no way to even know if it's hooked or Not hooked right like it just does it's normal thing and This becomes even more problematic with like SR RV where they they buying an SR RV device in there So now they have access to the network, but psyllium has no way to to do filtering right it's outside psyllium scope Unless you get into this now now you have this problem It's like well you could in theory hook into this SR RV control plane and figure out that it's added and add Your program somehow inside you could do like an in this enter and add it But then they could delete it right this is this is really ugly too, right? So I think you should just get away from this trying to race with the control flow and just say here's my hook I want to run on every packet that leaves my system well like Having a sleepable hook inside in the dev create work It would you mean in the sense that then that that hook would be able to attach things to it like a filter to it Or like a q-disk Yeah, but then we'd have to attach the filter and not it's not allow that filter be deleted. Yes by the container. Yes Because that's separate like to me it feels like orthogonal not be able to like delete it or see it That's we can solve it separately. Whereas like hook. That's not Gated by anything do well as well. It's just feels that it will be a harder sell Yeah, so two things I think if that BPF program could then say like here's Here's my program either in like a program app or something and then here's the dev from the BPF helper Just didn't call it and attach it that would probably be sufficient Right, you don't even need to call out to user space. You can just do like a just a BPF helper attach that would probably work Well, what are the questions there? So you said before like you don't care about the C group But you probably do want to know which C group the devices inside so that you can maybe do some like different policy Or different sort of logic. Yeah, I assume that the this would have So you'd have like a C group ID or something that I could grab and associate that with some policy, right? Yes, yeah I mean for equals it would probably work like you can go to from SKB to a socket and then to the secret To two things, right? So like the one is like the create hook Like I think what we're talking about here and the other is the actual data path side And on the actual data path I just on egress you do have the SKB which has the dev the dev has a net namespace I can probe all that if I want to write it probably would cash it But like you could do SKB to dev dev to namespace now you have the network namespace Now you're off and running right like you should have a map somewhere that says network namespace belongs to this policy group on Receive it's slightly tricky, but you still have the dev the dev has a namespace You have a network namespace in sort of kubernetes world a pod is sort of a Set of namespaces one of them being the network namespace It gets a bit tricky with containers right because containers can have different network spaces, but that's pretty Pretty anyone per net in us and not per secret, right? so so From my side, I'm not so interested in the per secret piece what I'm what I'm interested is ensuring that that pod there has no way to delete my hook and It has no way to sort of Get running before my hooks in place Which is you know, whether that's a generic hook all the way across that runs on everything and doesn't have any context Or it's a some how we hook create and have a BPF help or load it. I'm not sure I Care too much. So like not then going back to you. I'm not deletion part So we've talked forever about BPF link for TC One of the reasons is exactly not be able to run them up to delete it Sorry, so say that again. I think I may have missed it come like BPF link for TC Well, if you have holding that FD for an attachment, no one will delete it And could we do a link from BPF code? Build the link and put the file descriptor in a map and then the map is in the host context So it can never be some of it. Of course like currently there is no VPF link for TC I think last was Kumar who has had patches to do it and that never landed So same stuff may be more fundamentally. It's the same like why would have BPF link for XP but if I take a like a half step back, it's like Why do I want to be in the TC infrastructure at all because I'm not doing a Q disk? I'm just trying to like like the K probe hook that I have now is like almost all I need It just doesn't have some few of the helpers that would be would be nice to have so so that TC hook was without TC You wouldn't need like you just like the way we were Trying to like talk it through like this car and then Yes, that it's TC, but there's no TC. No need to create even knob Q disk What are we called ingress and egress Q disk none none of that because it's like it's just a hook Yeah So the idea was to have like from an EPI point of it looks like to see but it doesn't create any Q disk business There's no IP route involved in any of this. It's just that this layer Where currently all the hooks are but without creating all the Q disk off with the guarantee that Attachment is preserved by BPF link and then we could could we take where that egress hook is right now? And we could just sort of say there's a BPF hook here Here's its link and then we attach it with this BPF helper So link is an attachment itself like you have a program you attach it there you get an FD involved it stays attached Currently we cannot persist links anywhere but BPF FS I Think that would be sufficient Plus the hook at create time if which we could probably just do with a K probe, right? If we if we had this help this is called or sorry if we had this This helper exposed in sleepable context. I could just some of this like secure I think like might be even like LSM hook somewhere for the net link Maybe not from that death. Well, if not potentially that Should be there like in a definite of creation There was some worse. Yeah, I would say Yeah, or we could just put a K probe there with the helper, right? Probably don't need to even sleep to attach a program right to get the link If the program is already created already loaded do you need to sleep to link it? potentially not but We're talking about changing the kernel so like Backward stuff you would still struggle through whatever K probe and jump to user space Whatever other but in upstream. Yeah, why not to the proper sleepable hook that is clean and can do all sorts of other stuff And I think like from the from the TC data path side, I would love to have like a Dedicated TC ingress and equals hook where we don't need to go to the queue disk. So it's just annoying Overhead that is not necessary Right. I mean like for this particular use case site we use K probe hooks and like I said, it's clumsy and it's a little bit Racy, right? So we want to get rid of that piece and that and it works great So like there's no reason to try to get hooked into a queuing discipline, right? I mean like like from like why I want to get to from a studio in perspective at least is where You don't need to have too many changes. You can just take the existing programs as is and just attach them to this new API for the BPF system called that is still DC but Like the underlying code doesn't care whether it's right there for old kernel or new kernel It will just work that would be good and more flexible and We don't need to go through all of this. I don't know like TC side has a lot of offloading stuff that is being added like the SKP extensions and blah blah blah and this is just Annoying it's it keeps growing And this way we can make it even more efficient. So so I think the takeaway then would be to look at a create net dev hook with the helper to attach a link to To a hook in the egress in the ingress path Mm-hmm. I would say to revive the Kumar's patches I think they were pretty close to the kernel site I think was non-controversial was more on the likely BPF site which API would be and there were like some comments But I think it was like if I recall if you remember like I think it was pretty close Yeah, I Mean I would love to like from from a dependency perspective. I would love to have like this new Lightweight TC thing first where you have work based on file descriptor and then it's just the natural fit for the link Right, and then we don't need even need to deal with the legacy crap. So yeah And then we don't have to support all the TC stuff. Yeah, I think that I remember that was some of the problem Like what about all the other things that TC does like we don't care about this. We can stop caring about it at that point That would be great. Yeah, so I would say the goal if I understood you correctly it should to allow this what they call CLS scared act Programs so only those no need for And also the PPF can actually use it it will just probe The underlying kernel and it will then use the new one or fall back to the old one. That's also easy. You don't need Yeah Yeah, we should just do it. I think we've been talking about it for some time, but yeah, we have Urgent need for it now. I Do anyways, that's good Cool then that makes sense and then the next one would be can I add probe probe read and to To the CLS act programs. We already have to we have it there now. Is it there? You I think did I add it already? Okay, good Perfect Just use it. What's that? Get current task and then probe read and probe probe read if we think probably it's not there We should add it just so we have it. I think it's also there All right, and if you have the SKB we should add the dev pointer in there So we don't so we can get the dev pointer out That would be the next thing because once you have the dev pointer you're off and running like that's that's the main hook to everything else Okay Cool. I'm happy That's good Anybody anybody else? All right. I'll talk about stock map for a second too. We're sorry. Okay, cool all right So if you don't know what sock map is this is the blog post a nice surveillance, so I Told you I didn't do slides So if you basically how our system works is it does a three-way handshake Three-way handshake happens. We have a hook in the C group to say okay We're in the three it's been established at the established hook. We then add it to this sock map Which is a map in the kernel and when that happens the SK is sort of Extended with all this context for running a parser on the send message and the receive message side and so I'll just go over kind of quickly there's there's a couple bugs that we're fixing now that we see like Over long running tests, so we'll just fix those and then the sort of bigger changes that we wanted to make on this side Would be to get rid of this sock map and just have a way to say Add add the psock context to this SK and the main reason is because now we have to sort of right size that sock map And so in our original use cases what we were using this for is like a load balancing So when you load balance and I kind of come from the switch world and where I was thinking about a load balancing Like how you would do a switch Which means you just build a huge table and then you populate all the entries in the table and then you run some Kind of hashing algorithm over that And that that's kind of hand waving a lot. That's kind of how switches work And so that was kind of the same idea here as you built a the sock map table of some Power of whatever that fits and then you do a load balancing function over the top of them And you'll load balance into the network through these sockets So that was great for load balancing But the problem is if you want every socket in the system to run through this or every socket of some criteria to run Through this problem now you have to know how many sockets are going to be in the in the kernel and so What we were looking at doing is adding different hooks so that we can Basically get rid of that Map all together and you just say SK add Whenever you have a socket context, you should be able to add this this additional context to it You know now now please run these BPF programs on any send and receive and in theory there's no reason even to limit it to just Established just that's our use case in theory that we could like generalize that to other things I probably won't do that unless I find some reason to but So that would become the next change and then if you've looked at this API is at all We have this problem where the transmit side and the receive side are Not symmetric so on the such transmit side you worry about SK message It's like this on scattergather list and on the receive side. We still deal with SKB's The problem with that is that it just means you have a TX program at an RX program. It's not very friendly The reasons for that are sort of historical It originally used the stream parser, which was like this other bit of code per KCM But now we've gotten rid of that so the next stage is to get rid of the SKB side and just use the directly work on the scattergather list directly in here So that'll I think we'll probably start working on that here pretty quick If anybody cares and they're using this stuff, you know, I guess the main things are we want to get rid of the sock map So we don't have to try to fix that size and then make these two symmetrical And then the other third thing that comes out of once you've done all this is You can just send on any socket that exists in the system You don't need to even have it in a map because the send context doesn't use any Additional metadata just does the normal send So you should be able to take any socket and just send over it from inside the kernel From the socket layer. We don't care In theory unless I figure out some reason there's a race there, but I don't think there is and Then I would say the the last piece that we've been hitting a lot of this But I but I think we got a solution for it over lunch Was we've been hitting the tail call recursion limit and some of our programs that run here because they're doing all like parsing and so I think what we'll do if we have the UN the pointers and Then with the sort of iterative loop stuff. We should be able to I think you move over the stuff that we have Or tail calls over to that and should resolve it So I'll be I'll be hacking on that seeing what happens so First problem of like getting rid of sock map that sounded like you want another Notifier in the socket create not only at net def create and Then you will attach whatever sock map the old sock map style parsers to send and receive there Let the socket create time. Yeah, we already have a sock create Well, they have all the same hook and they're just like only enabling attaching And all the BPF program of this sock map type Well, first take these two types out of sock map abstract them is not related suck suck map Just keep them loaded somewhere and then take them and so I could create time and attach to that socket right that Yeah, that sounds like what you want. Yeah, exactly I think as a first step we would probably just do it inside the current hooks that we have like the established hook Just as a first step Like that's where we already have all the API's for this and then at that point moving it to sock create should be Just moving from one sock ops call to a different sock ops call sure Can you actually use it from a Might actually listen like from a non established socket I Don't know if it works. We need it's like it's not tested but it but it was it's allowed for So I should say it's tested in the CI I have not actually used it on any scale, so I don't know how well it works to be honest because they added it was added for the UDP Stuff, okay, but it's always a full socket, right? the UDP stuff maybe TCP not but yeah details I Yeah, I suspect that there it's probably not working out like correctly So like there's another question about like what you should do when a socket accepts So if you attach a program to listen and then it accepts what should you do should you should you transfer the program? That's on the listen down to this new socket Or should you have that new socket without any program? So right now it'll it'll when it does that accept and does the clone sock FD. I believe or something like this It doesn't propagate the programs down And so there's this gap where like you went from listen now It has no program in theory it could have done a fast Fast open and sent some data that was before the secondary hook got added so it's racy Gracie so like we we almost would like it to do like on listen to propagate that socket down And then you wouldn't have that race because you could run it on the fast open program But that that fast open hook doesn't exist, right? Because there's no way to Right now, that's not the same hook like that's not a sin message That's actually kind of a second secondary flow of data that you need to somehow actually hook directly It's sort of you need that code in that path, and we don't have anything there Right now we just audited and go okay, there was a fast open and we missed some data Like we know we missed it because we saw the fast open happen, right? Or we just block the fast open if you try to do a fast open we just Six, you know return a pair Yeah When you say Let's say we get rid of the second the map. You should still work on the UDP and other socket, right? Sorry, sorry say that again. So you say you plan to get rid of the map Yes, and they should still work on the for the UDP and other socket, right? Hopefully. Yeah I mean, that's not the plan get rid of the map I mean still allow users to have the map because the map is still useful as a like fundamental concept for like low-balance Right, like you look at that. I think cloudflare was using it this way It's like a load balancer on receive, right? But allow these other use case where you don't want to create the map and then And then like suck ideally it's not create when the stock is created you would then give assign a program at that point because Really all we need is that the Additional metadata to be allocated so that when the sin message happens we have this Buffer of space to deal with and all the metadata kind of where we're parsed it and there's a beep We know what BPF program to run right so on Okay, yeah, okay any other questions feedback or All right. Thanks. Cool. Thanks