 Hello, I'm gonna talk about xdp metadata mostly from Transmit side, but I'm gonna start with kind of generic xdp intro. I don't know if anyone is familiar So xdp is this Kind of old-school style hook was probably one of the first ones for BPF Which we have basically in the driver low in the stack. We have this BPF hook which can Receive the packet do something with it. Maybe send it out At some point we've added AFX DP, which is this new protocol family, which Using this xdp hook it can route some packets into the user space rings and they can also produce something to send it out later It's all nice, but I think up until recently the xdp hook was missing a bunch of metadata And now that when we have it The question is should we have something similar for TX? I have those kind of two blue Things here one might be a transmit side hook one might be Transmit complete side hook. So I'm gonna try to argue why we need that. I don't know in each form but starting with receive side metadata so What it is it's basically when you receive a packet along with the data you can get some Metadata from the nick because with the hundred G to hundred G. It's basically Too expensive for the CPU to do everything. So we have some uploads And Nick is saying like, okay, here's the flow hash for this packet Here's maybe the check some or maybe check some is checked and you don't even need to do anything here is the hardware timestamp for the packet and We've recently added this for xdp. So it's kind of From the xdp context you can read this metadata you can do something with it and the question now is like what what should we do for TX because For TX we also would like maybe from the Fxdp context To signal something to the nick and saying, okay, you need to do some uploading thing for example You can upload check sums all the like G so you can split the packet you can also receive Transmit timestamps. There's a lot of things we can do I put Tokyo's queuing idea later on I guess I'm mentioning here because it's out of scope for me But there's a bunch of things we can do on transmit as well And we either need a hook or something else to kind of grab all the data And this is again to argue that why we need it mostly because the rest of the stack is already using it as SKB basically using all this metadata one way or another But to reach a feature parity on the xdp side and to fully support fxdp Efficient cases we might need to expose it to the xdp context So currently basically we have this receive site implemented I would say there is this framework Which means when you load the program you can say that this program is bound to specific net dev and Specific net dev implements a bunch of callbacks That implement this metadata the callbacks usually they look at the nick receive descriptors parse something out return it And then that's it. We have a couple of examples of receive metadata for receive hash receive timestamp Over time this can grow, but I think we kind of have the foundation Which is those two things right the net dev bound programs and key funks On the c side. Yes, as I mentioned maybe more receive side metadata but I guess that's as the kind of It will grow organically when people need it we will add it and for transmit side I think we're missing a lot of things like we don't even have the hooks. We don't have metadata so What I need specifically is kind of super small subset For my use case I need to be safe harbor timestamps, and I need transmit harbor timestamps But I think solving this particular problem is like a too small and Probably won't fly so I'm trying to see if I can kind of Presented as a kind of larger effort where we have this generic receive metadata Generic transmitter metadata, and I'm solving this for myself with the timestamps Yeah, I Guess fundamentally what we need and it doesn't have to be hooks. It might be F entry whatever But we need two points. We need one point where we run BPF a Transmit before the packet is is out and we need the second point After the packet has been transmitted there is Interrupt from the nick or whatever some completion signal and we want to also be able to access descriptors to read out the timestamps I'm Specifying here XDP. I didn't know it might be confusing But I guess initially I was thinking about doing kind of full-blown XDP at transmit I know there's been an effort like that But yeah, I guess it's we still need to decide like how much of an XDP we really want to do on TX one other thing that I don't know if I Probably don't need it But the question is like if there is an AFX DP producer consumer Do we need to have access to the AFX DP kind of packet ring in case at transmit side or a transmit completion side? You want to put something into the ring to signal to the user? I Guess for my use case. I'm not considering it, but I guess I'm just raising here for for completeness sake and for I guess the way to Receive the transmit metadata for me the natural thing to do would be to apply the same idea as Receive have a bunch of key fonts that the drivers implement and It's all nicely obstructed So I've considered a bunch of things I guess first one was Doing something FX DP specific because that's what we kind of use but it seems too narrow because Why should it be FX DP? What if I just run XDP program and I want to get the same completion signals and so on another thing I was toying with doing like XDP like hooks At transmit just but I guess it's too complicated Maybe or maybe I'm not too smart a lot of baggage has been added to XDP as I told this like old-school Attach point based too many helpers helpers have a bunch of assumptions about everything So I guess in the end what I ended up doing is kind of Hit BPF like lightweight thing with a bunch of kind of Places I guess with the approach I'm still toying with We would have kind of device bound tracing programs like we have device bound XDP programs You would be able to attach those tracing programs only to specific places in the drivers We would do the same checks at the touch time to guarantee that you're not attaching this program anywhere But we would get kind of the same nice things as you can use k-funks those k-funks are resolved to particular net dev And they are efficient and we don't waste cycles doing the indirect goals So yeah, I guess my proposal I'll probably send it out soonish upstream is again looks like hit BPF You have BPF Cisco programs. We have a bunch of k-funks to tell okay attach this program FD To TX hook attach this program ID to TX completion hooks You can say when you're loading tracing program You can say this is a tracing program. That's device bound and you get metadata k-funks and Yeah, and you can use the k-funks. I have a question here So You're hooking to almost the same spot that XDP hooks on the ingress sort of the symmetric opposite Yeah, but you're not calling it XDP. You're calling it tracing hooks So I mean to me it just sounds like you named it something different. Why don't we just call it XDP TX? XDP TX is confusing. There's already XDP TX in a sense that you can send out the packet, right? So I'm calling it XDP egress sure XDP egress is fine If you go back one slide like the thing that the thing that caught my ear was that you said there might be multiple hooks Right like what's wrong with just a hook like an egress XDP egress So when you want for completion right as well too because you want to call back on the completion. Yeah Do you need both? Yeah, I mean I need the completion one, but I think you don't need the other one I actually probably need the other one because for TX timestamp What you usually do is you kind of prepack it say okay I need it right because you request it and then you get it and you want to insert a timestamp into the TX descriptor before It's submitted to the driver. I want to ask the nick to put the TX completion. Yeah, so Oh, so you're not okay You're not inserting the timestamp for the nick to submit in the future You're actually just want to set the bit on the descriptor so that when the callback comes back the nick has put the timestamp in the descriptor and then you want to be able to read the descriptor. Yeah Okay Personally, I think having access to the descriptor is super useful Like we have a lot of cases where we have hardware where if we could look at the descriptors on The TX callback and on the ingress to be honest And poll I mean there's useful stuff in those descriptors like any aircases any, you know number of bytes submitted I mean, there's just lots of useful information That we would collect and then report back at statistics. It's going to be per device that doesn't bother me I mean, I don't mind writing per device programs. I have I generally know what the devices are I know some people were upset about that at one point, but like If you have a Melonix card, you got to have a Melonix program, right? Or until you got to have an Intel program I guess what I would like here from my point of view to preserve the same things we have at RX Where you have k-funks that give you kind of nice abstractions But if you know what you're doing you can get access to the road descriptors and go wild Yeah, yeah, the only trouble we had with k-funks is it requires sort of Non hardware specific things, right? Like it's sometimes it's hard to abstract. It's easier just to say give me the descriptor I mean from my side it generally is easier to say just give me the descriptor and I'll read them I'll read the manual and figure out what to read But I get I think the k-funks also useful for portability. I suppose if that's if that's valuable So There are two cases of the X myth So one is the one the XZP-TX the actual like XZP X myth Which is different from the one that think you want to hook into so you're hook in this X myth right before and after completion when it goes from the normal Network in stack where you have an SKB. That's right. That's why once you look at lower, right? Yeah, it's lower. Yeah, you want to be at the driver You want to look at descriptor, but when the stack when the user space does the whatever TCP SKB it goes through everything you have a nice big packet and Nick didn't do like was asked to do TSO probably but like you just want to look at descriptor So this is not really XDP. So I agree with you that this is more like tracing hooks because You don't want like there is no XDB MD here There's no like abstraction what we have on the rec side as XDP XDP MD here you have an SKB, right? So this is like more net new net filter hook where we have like new net filter is still stable program type but and well Stable-ish I guess the hooks way in the networking stack But access to all of it is tracing style that filter has access to the filter state and SKB Here you probably want access to SKB as well Just in case you will decide based on packet contents like bits and SKB on bits inside the packet whether you want this time stamp or not So it's more tracing like so XDP here and even XDP names. I think don't fit XDP aggress also quite doesn't so I Would like try to proceed like this heated BP he BPF style that racing hooks Looking anti-X descriptor like makes sense to gerais before the transmit and after completion All that makes sense But I would also like think what's your answer to XDP style transmit Whether you want anything to be there or not But if you don't that's also fine But you need to articulate like that this is separate X submit and typically in the driver. It's also done differently They will like completely different cues for the XDP RX the one that XDP takes directly and I think XDP reject also use of them and separate transmit cues for the stack. So like as long as like it's all Described and documented looks fine. Yeah, I think one case maybe you're talking about SKB But also AFX DP like do we want this all at some point? We need for AFX DP to say I want this packet to be Chunked and split out or I want to have a tunneling upload, right? We need some vehicle for this to kind of trigger some kernel code. This might be also it right depending on yeah, it may be third thing, right? Mainly because there are different cues Like is the PTX one set of transmit cues stack is another I forgot how XDP FX DP is doing whether it's using one or the other Cues I think it's using the standard stack cues or maybe it's a third set of cues like it also it depends how drivers do In theory, yeah, FX DP makes sense also to cover but I wouldn't try to to all of them fit into one narrow bottleneck just because they're so different Yeah, I would I would actually keep the FKB path out of this and made it a XDP redirect slash AFX DP Hook only and hook it so that you have an XDP frame right now when you do an AFX DP TX it builds an STP, but it really doesn't need to it could just do an STP frame So that you operate on the HDP side only I if I remember correctly, this was also like including the FKB path was what stranded the HDP TX Effort the last time because it was simply too Complex with TSO and all the different paths that that an FKB can take But if we have an XDP frame that becomes a lot simpler And I think this is also useful for XDP redirect for the year the tune bit and so on So I just gonna say that the My experience is that the XKB path with the TX descriptors likely more useful for my use cases Sort of the opposite of what you're saying Right, okay. Well that implies we need both. I guess. Yeah. Yeah, it's just different use cases So I'm not like One problem with doing it as a Tracing type hook is that you don't attach to a net device you attach to a driver And then you have to disambiguate between different net devs yourself Driver right and that doesn't bother me in my use case because what I really want is some info from that TX descriptor And specifically actually almost more useful is that callback descriptor which has more information about what it did but this this is more of like an observability and security and debugging kind of thing than a Than a XDP traditional XDP like fast path redirect, right? We're not I'm not actually going to redirect anything I'm not trying to insert myself in the routing stack in any way, right? I'm just trying to collect Statistics about the system so I can sort of understand Things about the Nick like buffers are full. There was a you know, we did TSO or we didn't do TSO you know the various kind of all that stuff that's in that TX callback descriptor and kind of Create useful metrics from that I Would agree that the completion hook is the most useful one in the short term and I wouldn't mind doing that one first either Okay, yeah, I mean we can start with I guess I'll send both and we can argue which one makes sense We're using a lot of proprietary stuff. I think this one will be most likely be used with the GVNIC in the cloud I'll try to get the one X5 Another related thing I wanted to mention now that we were talking about queues is the fact that we don't have access to any kind of THQ object from AFHTP of other AFHTP HTTP redirect And I've been thinking about ways of doing that. Do you have have you looked at that at all when you are looking at this? Okay, if you if you want This would be useful on I think most near-term for us on a Melanox nick and I have some so If you want to send the RFC or something I could probably look to implement Something on Melanox as well. Okay Anything else I have a bunch of slides, but there I guess mostly just yeah the thing that we've discussed Yeah, I did not get why you needed the CISCH call program type Yeah, look at I guess hit BPF. That's that's the way kind of they attach But you don't have any kind of attach you API you you do the CISCH call program that calls a bunch of key fonks that attach Okay, that's it for me Thank you