 Hello, do you hear me? Okay, good I'm a bit early, but let's get going. So Can you surprise to see so many people? You want to see this sort of niche topic and I see there's some real-time people here as well So they'll probably bash me for the title. So we'll see. I Figured I started right away with some like a teaser, you know, everybody loves benchmarking and you know Depending on the count kind of person or you might be scared off. So it's so it's not time to leave So here's a test where I'm blasting a host with 50 million packets per second and We'll have two different mechanism for receiving the packets so in the first three different tests So in the first test we'll receive the packet and we drop it In the middle one We actually generate packet from the host and try to send as many as possible out and the third we do simple max walking So we receive the packet we swap it swap the MAC address and send it back out and As you can see the light blue here We're used as I'm called a packet which has been around in the kernel since forever I guess Jonathan can probably Tell us when it was introduced, but it's been around forever. So and it's typically what's used for two speed dump For example, so it's the mechanism before receiving raw frames in the Linux kernel As of 418 a new mechanism came up, which is called xdp socket and that's what I'll be addressing here And as you can see, you know, it's a nice benchmark at least if you're coming from xdp land so Sort of a packet roughly maxes out about one million and whereas XP sockets has much better performance All right, so before we actually can start we need some word from our sponsors the legal department and you're not supposed to read All of it, but I really like the second bullet area So no computer system can be absolutely secure and you know being from Intel and tying back to what Greg said this morning It's you know, all right, so a little bit about me. So my name is Bjorn and I'm working for this big ship center and I'm mostly working with Linux network and things on the Intel side I'm maintain the xdp socket and together with my good friend and colleague Magnus minus you can please wave And on top of that, I also maintain the risk five BPF jets All right, so xdp sockets. What is it and why should he care? So before I actually get into what xdp socket is we need to do some TLA That's three-letter acronym. So what a show of hands how many people I don't know what BPF is Okay, that's that's good So BPF I'll just do the whirlwind order BPF is or stands for Berkeley packet filter And it's a mechanism where it lets a user insert snippets of code into various hook in the kernel So you can use it for tracing and you can use it for Extending the kernel in a safe way so that the biggest selling feature with BPF is that what why you can in Insert a snippet of code. It cannot crash the kernel. So it's sort of the killer feature So way back at the net of conference for a couple years back Tom Herbert and Alexis Starvoito Had an idea like what if we add a BPF hook at the earliest place possible And the receive path and where's that? Well, that's right after the DMA has been completed And they named this hook xdp which stands for the express data path So there's that okay. We have this Receive hook where we insert a BPF program. So what can you do this BPF program? Well, first you can after you receive a packet you can modify the packet, okay? You can drop the packet you can take the packet and send it back out on the same interface again And you can do redirect that means you receive a packet You take the back and send it back out on it on a different network interface card you can Take the packet and pass it to kernel 10 the kernel like hey, please process this packet on a different core and Finally, you can actually take the packet and pass it out to a socket and that's next to be socket So it's a way of getting packets real fast from the XP program All right, so this is sort of the picture how it all fits in so at the bottom you have the network interface card You see next to be hooked there everything in dark blue has is related to BPF Okay, I don't have pointer but You see the BPF maps BPF maps It's a mech it's sort of a structure that that can be shared between use land and the kernel to pass information back And forth between BPF and the user So if you take the regular flow there the nick receive the packets Went to the XP program. You can do various decisions. There's again. You can drop it You can redirect to the XP socket which you can see right on top of XP program You can pass it to the regular networking stack which is done by something called XP pass So you just receive a packet and you pass it directed to stack And then when I actually enters the regular stack It will create a structure called the SK buff or the socket buff or just SKB for short That's the kernel representation of a packet. So When you have a networking pack in the kernel, it's an SKB unless it's in the XP world because then we don't have an XP because It's a whirlwind Makes sense And as you can see and this is part of what where the performance comes from like there's a whole lot more code When you go up to the AF packet then compared to AF XP clear good Right, so how do you actually use these done? So on the left you have a regular I know socket and this time it's For example, a UDP sockets So you start off by creating a socket you bind it to some address And then you start receiving packets and sending packets And typically each receive and each send is a system call and on top of that You need to copy data from the kernel side to the use land and as again as Greg pointed out System calls has become much more expensive nowadays with all these mitigations around On the right hand side we have The XP socket program, which is again as you can see in much more code So it's much more complicated to use and this is true again for If packets are typically like easy to use not as fast, which is a shame, but we can we'll get back to that later You start off by Start off by creating a socket of type XP And then you allocate something that we call the packet buffers. That's just an area where you want to place the packets And this packet area is divided into fixed sized chunks so for example if you when you registered the packet area later was State that okay each this packet area is divided into chunks that are say 4k So you can't receive packets larger than the 4k. That's us and then you register to the kernel and so and this is the first optimization like you don't have to do copies anymore because You let the kernel know that this is my area. We're actually will place the beta Third you create a couple of rings right Yeah, and I'll get back to how these rings look and How you use them, but that is that instead of using explicit system calls use rings to pass Events back and forth or pass ownership of buffers so in this rings and that's also a difference between Regular sockets and they've packet is that you actually pass descriptors to the data You don't pass the actual data and I'll get into the details where how these descriptors look like later Right and then okay, so these rings they're shared between the kernel and the user so it's a shared structure Finally, you do a bind call and the bind call is kind of special Instead of having an address you have a device And then a queue so most I would say all Contemporary nicks now they have multiple received and send queues So typically on Linux the default setup is that you have one received queue and send queue per core So for a fxp you can you have to select a certain hardware queue so for example if your nick has Or it's configured to have four received queues. You will need to create four a fxp sockets and then finally you start receiving events from the rings and Sending events to rings and you also so question is do we need any system calls at all? Well truth is we need some but I'll get back to that later And again, this is you know, this is a lot of boilerplate code So what we've done is we sort of try to make it easier to use by hiding it into library because LibBPF LibBPF is part of the kernel as well so you can Like my opinion don't write this raw code use the library instead All right, so this is what the XP descriptor looks like So it's really simple is it address. It's a length and a set of options So let's start so options is not used we have some ideas that to use it for To chaining packets because as I said earlier the chunk size is fixed So you can only receive packet of this like smaller than the chunk size The address is not a use a point pointer or use one pointer It's actually the offset into this packet buffer. So for example the the first chunk in your buffer array That's zero. So zero is not a null point pointer. It's a valid offset a Length again the length of actual packets So if you have data within these chunks, that's less than a chunk So say we have 64 bytes the length will be 64 Oh, as you can see in the top there. That's where you find the UAPI as well. It's in. Yeah, Linux IFXDP Okay, so here are the rings so for each XP socket there's at least or There's at least two rings, but typically there's four of them So first there are their C ring the oryx ring and there's the transmission ring the tx ring and Corresponding to each of those rings there are a couple of more and we'll get back to those later So there's a field ring and the completion ring So let's start so to receive a packet in the oryx ring In this case the user land process is the consumer and the kernel is the producer So the user will read off or pick XP descriptors from the ring and then bump the consumer pointer So what are we using this field ring for well? The idea is that as remember we registered the memory to the kernel and this mechanism the field ring is used to pass ownership from the user land to the kernel so we And as you can see that's not a complete Descriptor, it's just the address field because we don't need to pass the length when we pass it the kernel We just need to say like use this chunk Makes sense Okay, good And same thing with the completion ring. So it's sort of the upside-down version of the oryx side so instead of the Kernel being the producer. It's the user space application in the producer So we pass descriptors into there into the ring. We bump the point The head pointer and it will be consumed it will be consumed by the By the kernel and when the kernel has sent out the frame it will return The chunk back to use line application via the completion ring So it's one way of looking at these rings is passing ownership back and forth between the kernel without doing system calls all right, so This sounds good another thing, but do we need system system calls? Well Yes, it's the short answer so Since we're using shared rings between the kernel and use land the kernel need to pull these rings to get the events and One could imagine that yes, we could have a kernel thread running and just picking trying to pick events as fast as possible But in reality, that's you know, it's wasteful of resources. So instead We have a flag in the rings that the user need to check is called need wake up. So you check this flag and if the ring sort of signals to The use application that hey hey use you need to wake the kernel up and you wake it up with a system call For example send message or the poll system call so that is or When for example, if you're receiving packets, then the kernel will have will be working up a synchronously So you won't need any system calls But for example when you're actually sending the stuff then and not receiving then you need to explicitly tell the kernel like wake up and empty the ring And also and I'll get back to this later. So if for example if the user forgets to To add the entries into the fill ring That their CPath will be starved at some point and if it starred it's no Was what's the point of polling so instead it sets the need wake up flag and the colon goes to sleep or the or the kernel process or kernel thread doing the Reception and this sort of ties into other or big difference with These kind of sockets compared to usual socket. So with XTP sockets the allocation is done The kernel allocates frame from the user From the user land provided buffers Whereas the usual socket like a regular socket. They would you know use the page allocator allocate frame put it on the Nick hardware Ring and I received the buffer here instead. We actually allocate from the user land. So it's really important that Compared to other application the use land application need to be Well-behaved so for example you need to Say on the receipt path for example you need to fill the field ring in order to receive packets And also you need to empty the receipt ring in order to receive packets if the ring is full You won't get in packets a similar on the transmission side So if you place a lot of packets on the send ring, but you don't empty the completion ring Well, you won't send in the packets All right, so some pointers we can find more information. So First is documentation top one and then there's a sample application in samples BPF and Also the library that I talked about which sort of hides most of the messy details. It's called maybe BF. You can find out there I think there's a standalone version only BPF that Facebook is maintaining. So there's somewhere good to have but they typically pick like stable releases of the BPF and package it so you can if you're not Like if you don't want to handle the whole kernel tree that then that's one way of doing it all right, so Say that you want to implement support for this In your own driver. So how many here are show of hands? I'm in our driver developers. Okay, so that's like one third What do you need to your say that you have a driver that supports regular Linux networking? Because as you remember from the first diagram you need explicit support for xdp So you need to do modifications for the drivers. So that's sort of the cost of the extra performance Before we get into that we need to so What is needed from a driver perspective? So again, let's say that you have a driver that's you know have basic Linux support so you have a driver in the carol tree The the first thing you do since these rings that I talked about they are lockless and they're single producers and consumers So we need to make sure when we're producing to these rings that there are only one like they need to be synchronized And what most nick that implements XP does is Relying on something called not be so not be is an abstraction that The networking device drivers use To well implement the bottom half after interrupt So you receive an interrupt and then or it's it's a way of the firm work and it's built on top of the soft IQs So soft IQs are guaranteed to be only run once per core So for example, if you have a real old driver that not using not be then you need to make sure that you only Produce like you only you'll have one writer to the rings I think actually most drivers in Linux now are based on not be so But they might be some holds of round This so that's the first thing so when XP started that again a hook in the driver and This was you know that it was a grandiose idea, but in reality Adding support for XP has taken a really long time. So that yeah, you know, it's like four years four years old and Still the amount of drivers that actually support XP is very few So what the community decided on was adding something called like a generic or a fallback model for XP. So and So this is adding an XP hook Right after the SKB has been allocated So it's like it's it's a slow version of XP, but it was added in order to make sure that People can start playing with XP would actually drive the support So again For the best performance first you need to add the proper XP support to your driver secondly If you do that well, you get better performance, but you still won't get the best performance because They're not doing zero copy. So regular XP support still allocates frames from the page allocator and in order to Get zero cup support that is the the frame that would pass from use of land It's placed directly on the ring hardware buffer Then we need additional supported drivers. So there's like a Two-phase first add the XP support Secondly add Zero cup support all right, so this only addresses the zero cup support so When you start off by adding is the support you have the The first two NDOs so that's called backs into the Driver implementation the first one called NDO BPF. That's the one that sort of registers the program or the XP program to your driver Right the XP transmission transmit function. That's used for redirect redirect to other interfaces So that's sort of if you implement the XP you have support for that if you want to add zero cup support you need the one additional Callback and that's to wake up the kernel which I talked about earlier with a system call So you we need a mechanism to wake up the kernel What else? Yeah, so we add the serial cops support by adding a Adding a structure to the driver called the UMEM and the UMEM is the abstraction from the AF XP layer that Contains the packet buffer and the fill and completion ring So that is that use this UMEM structure and instead of using the page allocator you allocate from the UMEM And also it exposes the rings that you can use to Complete the transmission Descent packets and also to Allocate from the field ring. All right, so here Again, as you can see, there's a lot of Intel drivers there So and the most recent one is the ice driver was a 100 Gig driver The first one is for the gig driver and xtb is the old 10 gig one and also metanox has had the support for their newer Knicks and I'll think Broadcom is working on it or they are but we'll see when it actually arrives so Thought I'll finish up with some Benchmarks So the set to stop is as following so I use fairly new kernel From the BPF next tree Which is a pre 5.5 kernel I have if you know fairly new sky like machine That's pin to 3 gig gigahertz and a 4 gig nick And That is that I have a xtr package generated just blasts packets to the to the host. I Just I'm just using one received you and one sent you from the hardware. I Annotated the scenarios with two two course and one course and two course mean that the software queue is running on a different Core or the kernel side processing is running on different core than the use lab occasion Whereas one core everything kernel and everything runs on the same core And the latency measures man this sort of it's kind of naive actually to be honest it what I've been doing is Just add instrumentation to the XIA send the packet loop the packet back to the package and return and measure the end-to-end latency Right. So this is the first one that you saw at the beginning As you can see a if packet really maxes out at about roughly One million packets Just a note or when we added when we started this work, you know all the meltdown and spectra stuff weren't around but we're seeing I mean We're really Getting hurt by the spectra V2 stuff. So I think the on the receive side there would rock Close to 10 million packets so we need to work that back again So this is two course and as you can see one core is a drop interestingly the The packet Max-wap completely dies for some reason just non 9k your pack is there But still if you compare it to course to one core it's still I mean Performance per core is still better on one core What else interestingly you can see that Teaks only her max is out at 21 Million packets and that's because that's the horrible limitations We can't get out more packets and 21 million there So if we add an additional socket there, we can probably get more packets out All right, so this is the end-to-end latency So again, what I did I just you know instrumented packet and when it back Just looped it back with a max-wap Latency in micros, you know, it's really good here, but then again like this is a really few system calls So but in terms of late as it was we're really good compared to a packet whereas if we actually, you know dial up the packet rates with This sort of fire whole scenario would go from, you know, micros to actually millets, but the worst case for XP sockets is still like one-digit Millets, I think it's still pretty good. That's about it actually and as usual in these kind of things There's a lot of people involved so Thanks to all these people helping out, you know, right and go testing stuff. So so I figured it's a bit early But if you have questions, I'll be happy to ask them I'm a mic up here, so None. Okay Thanks for listening