 And Magnus is still in the first room, but I'll take the first part of the presentation. So what is this HTTP? It's a new programmable layer in the network stack, sort of before the network stack, and we are seeing similar species as DPDK. We'll get into more details of that, and we actually have performance comparisons. And HTTP sort of ensures that the network stack stays relevant. It operates at layer two to layer three, and the network stack operates at layer four to seven. Get into a little bit more details. So I want to admit that we're not the first mover, like there's other solutions, but we believe that it is different and better because our killer feature is that we are integrated with the news kernel. And we also have flexible sharing of the NIC resources. Little bit more details of what XDP is, is a internal fast path. And it is this programmable layer in front of the traditional network stack. It is already part of the option kernels, and actually also rail eight. And it operates at the same speed and the level as DPDK does. And we are seeing these 10X performance improvements. And one of the points of being the internal part is that we can accelerate certain use cases inside the kernel, for example forwarding. I'll get into more details about that. Then the second part of the presentation we're going to talk about AF-XDP, which is an address family for HTTP sockets. And it is what I categorize as a hybrids kernel bypass facility. Because we are allowing you to choose what packets should bypass the kernel all the way down into driver layer and deliver that into a socket that is accessible from user space. But we have this flexibility of the BBA program choosing what we want to redirect and not redirect out of the kernel. So why is XDP needed? It is basically because the network stack has been optimized for layer four to seven and we are taking this performance it's like once we get the packets the new network stack socket buffer, the SKB is named the socket buffer because it assumes in the day where it was written like 25 years ago that everything has to go into a socket. But there are certain use cases for layer two and layer three where we can do something faster and don't take this performance overhead. And that is what XDP is that operates at this layer. So I want to admit we are not the first mover here but we believe it's different and better. So there's actually a lot of kernel bypass solutions out there. The Netmap, the DBDK, which is I think the most prominent one at the moment, PF ring. I think there's also some people in the room that made that. Like long before we had something called XDP and Google did some solution where they have the MacLev which they all of a sudden published a paper after the DBDK came out and claimed they did this many years before. The open one node, we have the snap switch which is also here and there's actually a solution very similar to XDP which is a commercial solution from HAProxy called NDIV. But all of these kernel bypass solutions we're hoping we can some way find a way to integrate with them and maybe that they can use AFXDP inside of their solution so people that have been writing software for those systems can still continue using that by using the AFXDP stuff. So why is it different and better? Well, it's not bypass, it's the internal fast path and the really key feature is that we've integrated into the Linux kernel. We are leveraging the existing ecosystem as everybody here knows and that's fairly strong. We also have this sandboxing by the EBBF code where we get a lot of flexibility that you don't like have to recompile your kernel but you can actually put in these snippets of code that does just what you need and don't add like too many instructions there but you have to get the flexibility of doing this and it's also important at the sign criteria that we have like very flexible sharing of the NIT card, the NIC so we can choose, pick and choose a package travel into a network stack or we do something else with this packet. And we have the cooperation with the network stack by these helpers and we can do this fallback handling and we also get access to like basically by running in the kernel we get access to kernel objects like we can do the lookup I'm going to talk about in the route table a little bit later. We even have now in the recent kernel you can lookup from XDP to check if there's a socket that will match this packet. Right now we still don't allow you to directly manipulate the socket object you get back but that's something we'll add later. You can manipulate the socket objects in the TZ hook later where you can do the same lookup. So people are using this to, if they want to bypass but don't want to bypass if there's the kernel handles the socket you're doing the only thing you do in the XDP program is lookup do a lookup if the socket already used by the kernel then you send it to the kernel or else you will use the bypass facility. And that leads on to the AF XDP that's a flexible kernel bypass and we can deliver these raw frames in user space while we are leveraging the existing leak drivers and the ecosystem for the maintenance of that. So I put a red slide to shock you here like so you sort of wake up because it's fundamental understand that I'm seeing this as a building block that you should use and what do I mean by that? So I mean that I see XDP as a component it is a core facility we are providing from the kernel but I need you guys to pick it up and use it actually and you put it together to solve a specific task and by putting in like fully programmable items in there I'm not saying what you can do and what you cannot do this is like go and win something I couldn't imagine and so it's not a product in itself and I really want existing open source solutions to do it and there will also be some new ones that are going to use these components and what I really see like XDP is really like a hot sexy topic because we can do all these kinds of millions of packets per second but the real potential comes when we're actually combining it with the other BPF hooks that exist in the kernel and I call it like we can construct these network pipelines by using these different hooks, the BPF hooks and that is actually what the project's called Celium is doing it's primary for containers and they are combining all these different components. So now I'm going to talk about some of the use cases where XDP has already been used and then I'm going in each of the use cases talk about what's the new potential and opportunities I'm going to relate that to VMs and containers. How can this work out? I'm speaking pretty fast I think. So one of the most obvious use cases is the entity does, that was the first thing we sort of implemented that we can drop packets really fast because we haven't spent a lot of CPU cycles on it because XDP happens down in the driver layer and we are allowed to run this, our BPF program which is our XDP program. So Facebook have already for, they have been the front runners of this and they've also contributed a lot upstream and they hired the BPF maintainer but actually for one by five years they have been, every packet going to Facebook goes through XDP in some way and Cloudflare recently switched to XDP and they changed their NIC vendor in that process before they were running a proprietary stuff with the open on load from Solar Flare I think. Yeah, Solar Flare. Yeah, so some of the new potential here for entity does for containers and VMs is that a container like a Kubernetes cluster or OpenShift cluster you would not expose that to the internet because that would be fairly dangerous but you could actually use this loader XDP program on the host to do denial of service protection. You don't need, you want to expose this cluster to have another box, harder box that does this denial of service filtering because we can handle wire speeds, packets coming in from the XDP layer. We just load it on the host that the containers run on and protect, can protect the containers. The VMs is the same story. So in the host ways you could load the XDP program to protect the guest voices because it's fairly expensive to transfer a packet all the way into the guest to figure out that this package shouldn't be used and that's an easy way to overload the system. I'm, this is working progress the lowest one. Michael Serkin all of the time, every time I meet up with him, he wants to be able to from the guest as the host operating system by the driver to load the XDP program for it. There's some security concerns there. That's why we haven't allowed it yet. But there's a really interesting possibility that you can allow a guest to ask the host or as to load a filter, a denial of service filter. We also have to use case of layer four load balancing. So this is what Facebook is using. They used to use something called IPVS. It's a load balancer in the kernel. I mean, even the maintainer for the user space software for that, but I'm even recommending not using it. But so what they did was they switched to XDP and they reported a 10 times performance improvement. And not only that, they could remove some of the machines that did this load balancing because they do load balancing on the target machines themselves and shoot the packet over to the others. So there's no central point of failure. And they even open sourced it and it's on GitHub and called Catran, if I'm pronouncing it correctly. I think it's something to do with a fish. Yeah. So the new potential here is that we could also like do load balancing into a VMs or containers. So for the VMs, we can at the physical Nick layer use the redirect action and redirect into a guest Nick. And this way we avoid allocating the SKB in the host operating system, which only is used for sending the stuff into the guest. That's actually a really big performance improvement. Not a lot of solutions have really directly using this, but I'll talk more about how this could be materialized. It's actually in the kernel now. We have in the Toontap driver, we can redirect raw frames. The performance is quite excellent. There's what I'm seeing in my performance test is that it is actually, now we are depending on how the guest gets scheduled. So it's a scheduling problem now that we're hitting how many packets per second we can throw in. There's a containers is a little bit more difficult to use XDP because containers, they really need this SKB as structure allocated. But one funny interesting thing you can do is from the physical Nick, you can redirect into a VTH device, which is what the containers have. And it's a fairly recent kernel version you have to have. But it sort of got what you call native support. So it can bypass like the network stack and allocates. It is the VTH that allocates and builds the SKB. And there's a small performance optimization by skipping some code. But I see it more interesting that you could actually run another XDP program on the VTH and redirect into another container. By that, you could make interesting proxy solutions that works on this L3 layer that can, you can sort of install a container that only does proxy stuff for the other containers within and install as a container instead of having to, having had that installed on the physical host. But that's interesting use cases. Let's see if anybody does this. This is not something that has been done today. So the approach could be useful accelerating so it can be used for accelerating the VPN, but what kind of VPN? Yeah, so that I would recommend using something else called kernel TLS, which where we can actually, but for open VPN, I don't know that that's the TLS, but I don't know what kind of crypto to open VPN just. This, yeah, this, I think I have repeated the question if you could use it for accelerating like open VPN. Some of it I wouldn't use XDP for. So I would use some of the other TC hooks for actually accelerating that. Actually, I think I'll move on. So also, it's actually fairly easy to misuse XDP in the same way as you can like use these bypass solutions. So instead, I want people to be sort of smart about how we can integrate XDP in existing open source solutions and leverage the existing ecosystem for the control page setup, but that's a trick you have to do to do that. You have to implement some BPF helpers. Some BPF helpers is something you add into the kernel that and that's like BPF is something you load your program into the kernel and you're completely flexible there once we add helpers, it's becomes a stable API that you provide these helpers. And what I mean by a good example of what you can use helpers for having this slide. So the general thinking I want people to do is like see XDP as a software offloading layer that can accelerate part of the network stack. And by doing that, you could for example, take the route lookup which we already done. So we have to fit lookup helper we exported that as a helper function you can call from XDP. So what happens is that you will allow, you'll do your normal route setup completely as you have it today, you'll install your router demons, your BGP demons, whatever. And you will have the kernel handle all this and also handle the neighbor table or the up table lookups. And in XDP when you get the first packet in you do the lookup, but it's not in the up table so the lookup will fail. And you call what you call XDP pass. So you pass it on to the normal network stack which will then figure this out and call the up table lookup, set up a timer for when it has to request a new up lookup again and stuff like that. And the next packet we get in, it has done the up resolution, it will do the lookup and get back the next hop and also the MAC address and the exit point for the, I have index, exit point and from XDP we can then shoot out the packet directly from this level. So that's the way of accelerating the existing network stack routing like IP routing and both work for IPv4 and v6 up zones and now we're going to add some more stuff for MPLS but right now it's IPv4, IPv6 you can do this with and do routing. So it's just, it depends, it's very, very simple program you have to load in the XDP hook. The next obvious target is doing bridge lookup. So you can like have a helper that looks up in the bridge, fifth, also called fifth, but it's the lookup table and then you can accelerate bridge folding and that goes into my other point of how do we accelerate into VMs for example, if your VMs are set up and in a bridge that you have your VMs connected to the bridge which is usually the normal Linux bridge which is not very fast actually and that way we could just accelerate that directly from XDP without doing much like other users based code around having to code a program that's this. So when people start to play with it I also want to mention that how you actually transfer information between XDP and the network stack. So one trick is you can modify the hitters before you call a network stack. So even though we call XDP pass, you can modify the hitters and poop and push and pop hitters and that way influence which receive handler the network stack would use. That means you can in principle have to handle have the kernel handle a protocol encapsulation that the kernel doesn't know about. You still have to do some work on the transmit size also of course. Another trick that Cloudflare uses is they take the source at MAC address because that's not used anymore in the lookups and they modify that with a special value and they use it to sample drop packets. They want to sample some of the drop packets that the night of service system is dropping so they modify that and later they catch it on with an IP tables rule to lock that. Then we have something called metadata which is placed in front of the payload so XDP can write just in front of the payload it can also extend the hitters of course but if you don't want the network stack to see this for some reason and just some metadata you created XDP level you can use these 32 bytes. So the other hooks like the TC EPPF hook can use this and update the fields in the SKB but you can also save other information and the other thing is that for the AF XDP the raw frames we deliver into user space it will also be put in front of the payload so you can get this information there. We have a lot of interesting idea of getting the hardware actually to put in this metadata for example getting a unique ID for every flow that's something that the hardware can provide. A very interesting thing is OBS, the open V switch. So William from VMware actually implemented three different ways of integrating with BPF and he did a presentation in the plumbers so we have, he actually did a full implementation of a re-implementation of OBS in EPPF at which I thought was a little bit problematic but he basically re-implemented the whole thing. My whole idea would, and he had to have several tail calls to put in all this code to handle all the different kind of cases. So that was basically putting too much code in the BPF step which I think was a mistake in itself because you should be a little bit more smart and use the second solution to offload a subset to XDP. Don't have the corner cases there but fall back for the corner cases. He didn't succeed with that because he was missing some of the helpers I talked about before that he should have argumented that he wanted to add some helpers to do a lookup in the OBS kernel table. What he did also was actually also implemented the AF-XDP integration with OBS that showed huge performance gains and I think that's what they're going with now. And we're going to hear a little about AF-XDP in just a minute. I'll hand over. Thank you. You gave me a couple of minutes at least. Yeah. Okay, so it was actually up here a year ago together with Beyond presenting AF-XDP for the first time and that point in time it was an RFC of dubious quality because we used it to scare children with it. But a lot of things has happened since then. It actually got into the kernel in 4.18 in August and the two first zero copy driver support stuff got into 4.20 in December. So tons of stuff have happened. So what's this talk going to be about to this part of the talk? It's going to be about three things. I'm going to show you where we are performance wise. I'm going to show you some of the use cases and tie them into what Jesper talked about. And then also tell you what we're going to focus on for this year and try to get into the kernel. It's not going to be about how AF-XDP works. If you want to know that, you should have attended last year. I'm sorry. Now you can actually go back to the Linux Plumbers conference paper that was published in November and read it there or just talk to Björn or me or you can talk to Jesper too. He knows how it works. There's other people in the room. Elias, you know how it works too. So just talk to people and we can explain it. Okay, so I'm just going to go through a little bit of the basics. I mean Jesper already covered some of this, but really what AF-XDP is, it's the way of getting packets from XDP out to user space very, very quickly. Completely unmodified. It's whatever XDP does with the packet. That's what you see in user space. And actually it is an option. I mean, you know when you write the XDP program, you can either direct your packet into the kernel stack with XDP pass. You can redirect it to another Nick driver, get it out there. And we added an option to be able to redirect it into user space. So you can actually target, you know, you can tell exactly what socket you want to redirect. So you can actually make a load balancer in XDP to load balance packets across AF-XDP socket. So really nice. So, but really what we're going to target here is performance. So where are we now? And I'm going to start by showing you where we are with a code that's in kernel.org at the moment. So 420. And the methodology here is that we just have a, you know, regular Broadwell server, 2.7 gigahertz. We use the Linux kernel 4.20. We have all the spectra and meltdown mitigations on. So all of it is on. So we have not turned that off. We use two NICs, two Intel I4D NICs, 40 gig NICs. Cause we actually need two to show you the numbers, which is good. And we're going to use two AF-XDP sockets per NIC. So it's going to be four Qs that we're going to use in these benchmarks. I'm going to have an XLO generator like a commercial load generator just blasting at this NICs at 40 gigabits per second or per second per NIC. So just full blast. And we're going to start with just showing where we are with the current code base. So these are the zero copy drivers you find in 4.20. They were not optimized for performance. They were just, we're just so happy we got it in, you know, it's just, okay, I just get it to work and just get it in there. But I'm going to show you how we compare against AF packet because today, I mean, if you use Linux and you want raw packets to use the space, you use AF packet. And AF packet is the purple stuff on the bottom there, which you barely see. And the green stuff is AFX to be in zero copy mode. It was people on other where AF packet is TCP done. Yes. Oh, why are you talking in the LibP cap, you know? Or you're rather, lots of other application running, but that's the usual case. And this is, the green one is AFX to be in zero copy mode. On the Y axis, you have a megabits per second, 64-bit packets. And you have three different applications here on the bottom line. Millions of packets per second. Millions of packets per second. Sorry. Yeah. And on the bottom there, you have three micro benchmarks and they're really simple. So the first one is RX drop to the left. It just tries to receive packets as fast as possible. Doesn't touch the data, just receive it, drop it. I mean, anything that you do with RX is going to be slower than this. I mean, because you're not doing anything, just receiving it. And TX push is the same thing, but for TX, you use pre-generated packets and just try to send them as quickly as possible. You don't touch the data there because it's pre-generated. So anything with TX will be slower than that. That's like the fastest you can go. And then we have another toy application that actually touches the data. It's just an L2 forwarder. It does, you know, receives the packet, does a max swap, so it touches the header, sends it out on the interface again. So two first ones, do not touch the data. Last one touches the data. And as you can see here, we're somewhere between 15 and 25 times as fast as AF packets. So you can actually do a packet sniffing on a 40 gig. Why? I mean, not if you send 64-bit packets, obviously, but if you have larger packets, you can actually do it now with AFX B0 copy. Yes? Is it with one core? It's a thing. No, actually we get to that. It's actually two cores, but we'll get to that. That's very, it's like two slides. We'll get just from two cores to one core. I'll show you. So it's basically one core because the application goes not doing anything, but it is two cores that are used. It's here, it's just off the IQ running, but we'll get to that, two slides. So what we did then during the fall was to say, yeah, let's actually scratch our heads and try to do some performance optimizations on this code and see where we can get. So that's what we did then presented at the Linux Damage Conference. And now the previous results you have are the purple, bluish stuff on the left there. And the green bars is the patches we have in-house and are now trying to upstream performance-wise. And as you can see, the green ones here, we can get an increase of 1.5. So 150% increase in performance with the performance optimizations that we have now. And some of them are really simple. And we've already upstreamed some of them. Others are more complicated. But this is what you can get now. So we're talking here like receiving 39 million packets a second for one application core and one software IQ core. And TX, we can get all the way to 68 million packets a second, which is pretty impressive, I think. And then when you start to touch the data here, of course it drops because you have to bring headers and stuff into your cache and we're down to 22. So still the improvement here is like from 80%, 95%, 90% to 150% compared to what we have now in the 420 kernel. So now I think if you look at the green stuff here, it's starting to look pretty good. But we'll get to the question you had there. So how are we actually running this? And there's a link down here to the paper. And the paper will tell you all about the performance optimizations we did to get to this. But I'm not gonna go through that today, but please click on the link, download the paper. And take a look. So how do we actually run this? And that gets to your question. So currently with the benchmarks I showed you, we run this in what we call, it's gonna call it run to completion mode. So we have two cores. One core is just doing the software IQ. So it's receiving the packets from in NAPI mode. And the other core is the application. In this puny little micro benchmarks, the application core is not basically gonna do anything. It's gonna receive packet and drop it or just send them. So it's just gonna be very likely loaded and the software IQ is gonna be 100% loaded. So that's the bottleneck. But really, this is not how you would like to run it because you're gonna waste the whole application core spinning because it's just busy polling. It's 100%. And it seems that most people don't wanna do that. They wanna do something where you actually can sleep. You call a syscall, go into the kernel. If you're not doing anything, you can schedule in something else or power save or whatever you wanna do. So then we can actually do that. So we have to the right, there's what we call the policies call. So if you just take the file descriptor or the socket and provide it to the policies call and call it, there's something called the busy poll. It's really confusing. It's called the busy poll mode of poll. What I'm just gonna call it the policies call. What it does is when you call poll, the code is gonna itself drive the NAPI context in the driver. So the application calls poll, you go down, it starts to run the driver in the same context, get some packets, go up again to the application, the application reads them. So what you get there is that you get more like a DPDK way of doing it. It's just one core driving the whole thing. The one core where you have the application, you have the RXTX poll, the NAPI and so on. So uses only single core. The difference here of course between DPDK and DPDK, all of this will run in user space. You have no sweet mode switches between user space and kernel. We have to pay for that, of course, and the syscall. But you're not gonna call poll for every single packet. You call it for batch of packets. And in the examples we have it's like 64, but you can tune it. This support, the basic poll support is actually in the kernel already, but this support where you can execute all the way to NAPI, it's not in the 420 kernel. So I'm working on that and I'll send some RFC out in a few weeks. So, but it seems like people, that's what people wanna use. So how does that look performance wise? Of course it's gonna drop. We're using just one single core. We have to do mode switches, we have to do syscalls. So it's gonna drop compared to the other mode. And what you can see here, it's like, it's for the RX drop, it drops from 39, down to 30, from 68 to 51 for TX and a drop from 22 to 16. So we have a drop of 20, 30% compared to, but we're only using one core now instead of two cores. So if you look at it on a per core basis, we're actually performing better. Because you can now use two cores to process it. And of course that will be doubled and it will be 60 million packets a second for RX drop and a hundred, two million packets a second for TX push and so on. So I still, if you look at it like that, this actually performs better. But the key question is now, how do we compare to DPDK? Because DPDK is the benchmark of how fast you can go. So if we compare here, now we have four graphs on each single benchmark. So furthest to the left, you have the run to completion mode in AFX B where we use two cores. And the green one is the pulse is called, we use a single core and the syscall to go into the kernel. And then you have DPDK with a scalar driver. So same kind of driver that we use in Linux with scalar. And then you have furthest to the right, yellow one or orange one is gonna be DPDK via the vectorized drivers. And we don't have any vectorized drivers in the kernel, at least not yet. If we just ignore the vectorized driver first and just look at the other one, we can see that we're still lagging some behind for the RX path. It's maybe 40% drop if I compare the pulse syscall with DPDK. Because DPDK here is only using a single core. It's not using two cores. So it should be compared to the pulse syscall. On TX it looks better. I mean run to completion mode, actually it's faster than the DPDK scalar driver again. But if you look at the pulse syscall which uses as many cores as DPDK, it's still a drop there within maybe 20% or 80% of DPDK speed there. And it's similar results if you go to L2 forward. It's a 10%, you know, 12% drop there or 20% drop for that. But really also can see here that yes, the vectorized drive, at least in these very simple ones, it doesn't pay off that much for the TX push or the L2 forward, but it does have a significant performance improvement on the RX side. So I don't know if it's more efficient on RX side, Bruce can comment on that, but there it's really, I mean, it's like a 30% more than that performance gain. So you can argue, should you start to implement these things in a Linux driver? You know, I don't know, it's complicated to write them. It's hard to maintain, but if they give a significant performance improvement, maybe it's worth it. I think it's a good question. But now at least I think for the TX side here, it's starting to look good. This is within the bounds of what I think. Okay, yeah, this is reasonable. We're never ever gonna be as fast as DPDK because we're doing syscalls. We have user space versus kernel space and those things. But it's getting there now, I think. So, but RX side, we need to do some stuff. And Jesper has some ideas. I mean, you and Bjorn are looking into some performance improvements here. So we have other ideas to introduce, which are not in this. So it's still possible to improve this. Okay, so let's look at some examples of use cases for a XDP. So the first one, obvious one, is to write an EFXDP PMD for DPDK. Because EFXDP is about getting raw packets out of user space quickly. And if you have something up there, for example, or a user stack, you're probably gonna use DPDK. So this is the most obvious one. And we actually have a RFC out on the DPDK mailing list for an EFXDP PMD driver. And it has about 1% overhead, compared to just running EFXDP, which I think is good. It's actually less than 1% overhead. But the advantage is here is that what you really can do here is like, which is nice, if you have a stream of packets coming into your system, and some of it needs to go to the kernel stack, and some of it should go to user space to some processing. XDP is a great way of solving that. Because it can divert already in the driver some traffic to the kernel stack and some traffic up to DPDK in the user space stack. It also creates a hardware independent application binding. If you only have the PMD of EFXDP, it's gonna work on all drivers supporting EFXDP. If you have a user space driver, DPDK driver, of course, it's not gonna work on the next generation of anything because you don't have a driver in there. So that's also good. It's also provides isolation and robustness. When you're not sharing any memory, you can restart it. You can use all these security features of Linux because it doesn't rely on physical, continuous memory and stuff like that. EFXDP is just another socket. And we think it's gonna be a good support for cloud native or good solution in the cloud native space, too. Because EFXP is just a socket, it's an OS abstraction, works really well with processes or containers because they're the same thing. And you get fewer setup restrictions. So we think there's a good strong use case for having an EFXDP PMD in DPDK. The next one is VPP, which is a very popular stack from Cisco in the FIDO project. Yes, I mean, you could just take, was VPP supports DPDK. So you could take this AFXDP PMD and just run VPP. But actually you could also just write a native AFXDP drive in VPP. Because there's an AF, AVP driver in VPP. There's AF packet drive in VPP. So you could do that because it doesn't, VPP doesn't use that much of DPDK, basically the driver. And the rest, it just implements itself. So that would be more efficient. I don't know how much more efficient it will be, probably not that much. But it will be a lot simpler. Much less code and easier setup. If you just used AFXDP right away there. But nobody's tried this. So is anyone working with VPP? It should be easier to hack this, so please do it. It will be fun to see. Are we missing anything? There's some functionality we need that, we don't have in AFXDP. And the last one, AFXDP integration with snub switch. This is an idea, it's Luke here. No, right? So he can't comment on it. So you can actually use AFXDP maybe in snub. But the question here is what kind of functionality is missing? And the nice thing about if you do something in snub on AFXDP, it's gonna work on all the drivers supporting AFXDP. Instead of like now where you have to write them for every single driver. So it kind of becomes a hardware abstraction interface in this case. But there are some things that snub uses that we don't have at the moment. So the question is what do we have to add to AFXDP to support snub? But I don't think Luke can answer that. Maybe he can do that later. Okay, so some ongoing work. So what are we working on right now? Of course, we're upstreaming this performance optimizations one by one. I mean, some of the simpler ones have been already upstream and they're more complicated ones we're working on now like the policies called support there. And something that Björn and Jesby's working on is this XDP programs per queue. So now when you install an XDP program it's per net dev. So it covers all queues. But what they're working on is actually so you can install an XDP on a single queue. And that's gonna be a big performance boost on the RX path for us. It's also gonna make it a lot simpler to do things. But that's, I don't know how long you think it's gonna take but it's not trivial, I guess, to do that. But yeah, I don't know. Another thing we noticed when we got these things out it was too hard to use AFXDP because it required you to write an XDP program, compile that with a clang compiler, load it into the kernel, lots of stuff you had to do. It was just a big headache to get going. I mean, sockets should be easy. That's what you want. I mean, you just want to do socket, you bind it, off you go. That's it. So what I did now and there's a patch, it's a V3 patch up on the kernel and it's gonna be V4. It might be something else but it's, I included in the LibBPF, I included helpers. That makes it really, really simple to set up these sockets. So basically you only call two things. You call a crate human, which is the packet say I create socket, and off you go. That's it. And they're helper functions for everything. What people used to do, they just used our sample which was pretty rough sample and you just cut and paste from that and paste it into the programs. But now we have this library instead LibBPF where you can use good optimized functions for these things. And this makes the sample program so much cleaner and nicer. Now it's only application. You also added the BPA program in itself, right? Yes. So you don't have to go out and compile that. No, you only need GCC. That is a small array of the BPA instructions that is loaded for you. So it loads everything under the hood so you don't have to care about it. Of course you can still load your XDP program, your own if you want to. Of course, when I'm hinted at that. But this list facilitates that option. And another thing, I mean, when I showed the AF packet performance numbers, so AF XDP and AF packet does not have the same functionality level. AF packet has more functionality. And one thing we're missing from AF XDP is this packet clone. So what happens when you do AF packet is that one original packet goes to the kernel and you get a copy of the packet to user space. So we don't have that functionality in XDP at this point. So what we would like is to add this into XDP so you can say, so time's up, it's gonna show me pretty soon. I thought it's gonna, oh, five minutes left, great. So we're gonna add that to XDP. So you can actually clone a packet and then send it up. And then you can, this could actually be something you could use with the lead peak app. So with wire shark and TCP, which is nice. Cause then suddenly you can, you can sniff a 40 gigabit per second interface. With nearly a single core, not really a single core, but two cores you could do it. And there's also something we want to start and start with with other people is adding metadata support to AF XDP. That's also something you need so you can put out your timestamp and sync that AF packet has. Okay, to summarize, XDP, Express Data Path is the new Linux kernel fast path. And AF XDP is just getting packets to use the space from XDP, unmodified after that. And we're trying to hit DPDK like speeds. Never gonna be as fast, but 80. If it's 80% of that, great. And really it's a building block for solutions, both these things. It's not a ready solution in itself. You have to build stuff on top of it or inside it. And there are many interesting upcoming use cases like yes, we talked about OVS and bridges and stuff like that. And notice if you have OVS in XDP, AF XDP becomes a concrete after that, which is great. You get the traffic after OVS or after the bridge or after processing your IP or something. So it's exciting. You can get more cook traffic with AF XDP with the help of XDP. So yeah, come join the fun. Questions? Yeah, you talked about integration in DPDK and DPP and what about integration in some languages? In terms of? In languages, like Rust. Oh yeah, yeah, so. Is DPD in Rust? Yeah, I saw that somebody added Go language support for AF XDP. And I said, I haven't seen Rust. We're not doing anything there, but it seems that people are starting to add support. So that was, yeah, the repeat the question. So language support for AF XDP. So we saw somebody from Google adding AF XDP support to Go, the Go language, but just talked about Rust. Yeah, I haven't seen that, but hopefully somebody's working on it. Yes? Yes? Have you considered any other model, like Dora's internal space? Yeah, we haven't. No, it was a while ago on Pittsburgh. Yes, so the question was, Paul, it's an overhead, yes, because he has a CIS call, but it's a simple model. People know how it works. So that's why we started with that. And we want the AF XDP sockets to look like sockets and feel like sockets, because it's simple. But you're right, I mean, you could have something like a doorbell function, especially if you have hardware underneath, that understands the doorbells, and that would be more efficient. But we haven't, we have thought about that direction, but there's nothing concrete. Sorry. Other questions? Yes? Yeah, so there, in AF XDP, there's three different modes you can run in. You can run in what's called XDP, XKB mode, which works on any NIC, doesn't have any, any NIC will work, or any virtual NIC will work too. We call it generic. Yeah, generic XDP, any. So it works on, and by actually allocating the SKB and then pretending the SKB and converting the SKB into something that is contiguous memory, so it works on, but of course you lose performance, and then you have the, like, But then it's still, I mean, the performance of that mode is still three times the performance of AF packet. But then you have what we call the XDP drive copy mode, which is if you have XDP support added to your driver, like a lot of people here have, you know, you can run that mode, and it speeds up RX to about 10, 12 million packs a second on our hardware. But TX doesn't have any of that support. And then the third one is zero copy support. You have to add that into the driver, and we're trying to make that simpler. I don't know, was it like 700 lines of code, 800 lines of code? Let's say 1,000 lines of code for our driver. You know, so it was not, it's not something you just do in a day, unfortunately. The important part is that the user space part looks the same, so you program, like, you don't, that's abstracted away from you, that's the important part. Well, yes. Today you have to choose Intel, actually, to get you supported, but you can, you can tell us. Or Melanox, Melanox is actually. A Melanox, too? Yes. Oh, yeah. Yeah, and the XDP drive support, you can pick a lot of vendors, like Netronome and, you know. Netronome also works, right? Yeah. But for the zero copy support, yes. The zero copy support is only in the Intel. Yeah, I want more people to support it, so I'm just, you know, hope it's not only us at the end. Yeah, I mean, actually, yeah, you have some implantation on arm, so. So for zero copy. So there's another question. Oh, you should have that. There's, there was, there was a person in the audience that he tried to implement it and got rejected on the main, he's there. So he actually tried to implement it and it got rejected upstream on, on the terms that, that right now XDP is like the raw ethernet frame and all of a sudden with Wi-Fi, it could, it could start in different ways. So it is sort of possible, but then we would have to introduce like different, different hooks in the, in the Wi-Fi to, to, to determine what kind of type of packet is this coming in. And this, we would have to have some if statements in there and we are counting nanoseconds here. So we didn't want to introduce anything that slowed down our performance for supporting Wi-Fi. So I think if we want to support Wi-Fi, it will be in another XDP mode. So it will be called Wi-Fi XDP or something. That's all we have time for. Thank you very much. Thank you. Thank you.