 Hi, we're going to present XDP and the PagePool allocator and how you can easily add a driver or convert an existing Linux driver into using XDP by using an internal API. The reason we decided to talk is that I'm Amelias. I'm the technical lead for Linaro Edge and Fog Networking Department, and I'm serving as a co-maintenor for the PagePool API at the moment. I've added XDP support on a kernel driver. And Lorenzo is a software engineer for Red Hat. He's maintaining a wireless driver. It's MT76, right? And he's added XDP support on the Espresso bin board, which uses the Marvel MVneta driver. So do you know what XDP is? Anyone does? All right, good. Let's go a bit faster on this one. It's a software offload path for the kernel. On the driver level, we add some hooks on the RX path. And by using the PagePool API for the memory allocation, we don't have to keep reallocating and freeing memory when we process the packets. We just have to sync to the correct direction, the DMA direction for the CPU and the network interface to pick up the packets. It was initially designed to operate on layer 2 and layer 3, while the Linux kernel operates on layer 2 to layer 7. But it's mostly optimized around layer 4. There's two reasons we get better performance on XDP. The first one is that we, on most of the cases, we managed to recycle the memory we're using. And we skip all the kernel path that we don't really want, like IP tables or the T-Sync hook or the root lookups and stuff like that. It's important to keep in mind that this is not a kernel bypass. One of the functionalities is a kernel bypass. And you can dump packets directly to user space. But it's an internal fast path. And we'll elaborate on this a bit more. It uses existing kernel APIs and existing kernel functionality. And you can program the number of packets and the type of the packets you want to process through BPF. It's currently being used by Facebook and CloudFer on load balancers, on DDoS protection. So this is pretty much how the architecture looks like. The XDP you see on the driver is actually the BPF program doing the decision making. So if it's an XDP pass, which is one of the actions you have on XDP, you send back the packet to the Linux networking stuff. If it's a TX, an XDP TX, what you do is that you send the packet back out of the interface it came from by changing the header or the source and MAC address or the IP or anything you want to change on the packet. The XDP redirect currently sends the packet over to user space or a remote CPU or anything you decide on that one. And there's the NDO-XDP-XMIT which you can pick up the packet the moment it arrives on your network interface and offload it to a different network card without having to go through the kernel network stack. Now, the reason we created the PagePool API is that the memory model for the whole approach is a bit weird. We require packets to be in contiguous physical memory. And this is not a requirement from us. It comes from the BPF direct access for validating the packet and the correctness. And you can't have one packet split across multiple physical pages at the moment. So you're limited to non-jumbo frames and we don't mean 1,500 byte packets. It's just anything below a page size can be accommodated in an XDP frame. So the problem with that is that you cannot allocate the memory you want with whatever we have in the kernel of functions like NaPyAlockFrag which allocates fragments for your data and it's faster because we cast things in there. You really have to allocate the page. You have to account for the headroom and the tail size we need on BPF and for whatever you need on the SKB. Now, what we discussed is that the buffers must be recycled in order to get the speed. So the PagePool allocator we have, it's optimized for one packet per page. We have used cases of people splitting the page and fitting it multiple packets in it, but you can't recycle based on the PagePool allocator recycling functions. You have to recycle on your own on that case. On the native packet recycling, we do it in the NaPy context mostly. So this is really fast because you don't have any extra locking overhead. You're already protected by the NaPy context. And the API also offers DMA managing capabilities. That means that it can map your buffers. It can sync your buffers correctly and there's improvements from Lorenzo that speed up this even more. Now, this is not all perfect. If you switch from NaPy allocator SKB that the kernel is doing to an XDP, your normal network stack in the kernel is going to slow down because allocating a page compared to allocating fragments is substantially slower. But if you use it for XDP, then you get all the native performance improvements we have with recycling packets. The memory footprint is bigger because instead of allocating the amount of size you want for the packet, you allocate the page and you feed the packet wherever you want in that page. And we do have some off three patches to get some performance back from that penalty. So we have patches and we manage to recycle buffers even if they address the normal network stack. So if it's an SKB, we eventually recycle that buffer as well. So actually, Ilias has gone through some XDP requirement and some more general information about XDP. And I will give some more details about how implement XDP in an Ethernet driver. Actually, I use the MFUNETA Marvel 1 gigabit drivers as reference since, for example, the Intel or the MFUNETA Marvel implementation, the Melanox implementation are much more complex. We need to take into account that in order to be accepted in the Linux kernel, our driver needs to implement all the possible XDP verdicts that are XDP drop, XDP TX, XDP pass, and XDP redirect. Here, I reported some hardware specification of the Marvel Espresso bin that is the development board I used to add XDP support to the MFUNETA driver. And we can see that the Marvel Espresso bin runs a Cortex A53 and for networking, we have 2 gigabit Ethernet LAN port, 1 Ethernet 1 port, all of them connected together through an Ethernet DSA hardware switch. This diagram, I client the lifecycle of a buffer using a PagePool allocator, and we can see that the PagePool allocator is usually created opening an interface since it is actually associated to a given RxQ in order to avoid locking penalties. From here, we can see that it's possible to rely on the PagePool API in order to for the MA mapping and the MA syncing using this flex, actually. And what is important to notice in this slide is that when the NappiPool runs, it actually run an EVPF program that is attached to our network interface. And the EVPF program will return an XDP verdict, let's say a result. Then the buffer will be recycled according to this result. Actually, the PagePool allocator will have two caches, one in intrap caches that is used for when the driver is running in intrap context and we have a single reference to the buffer or a pointer ring cache that is used when we have a single reference to the buffer. Whenever our driver needs to refill the DMA engine with the new buffer running, for example, in this case, the MEVNETA-RxRefill, we can access to these caches instead of going through this lower page allocator. Here, I reported the MEVNETA XDP architecture. And we can see that whenever the MEVNETA-Pool runs, it allocates an XDP buffer that is the counterpart of an SKB for XDP. And the MEVNETA-RxDP will actually run the EVPF program in the EVPF sandbox on our XDP buffer and will retire one of this XDP verdict, XDP pass, XDP drop, TX on the redirect. And the buffer will be managed accordingly. It's important to notice here that the XDP buff, the struct XDP buff, is allocated on the stack and not through a KMM cache as is done for a classic SKB. Now let's go through each possible XDP verdict. And let's consider XDP drop. XDP drop is returned by our EVPF program when it wants to drop the packet as fast as it can. And the typical use case for XDP drop is an anti-DDoS application. We can see here that whenever the packet returns XDP drop, the packet will be recycled in interrupt cache using PagePool Recycled Redirect. And moreover here, I reported the comparison between a simple program that just runs XDP drop and the same functionality implemented with a DC filter and DC action. And we can see that with DC, with XDP, we can almost reach 600 kilopacket per second drop, while with DC we can just roughly drop 180 kilopacket per second. Here we see the XDP TX. How is the XDP TX works in the Mevunet driver? And XDP TX is used to transmit the packet back to the interface where we receive the packet. We can see that now, typical application, for example, is a load balancer in this case. We can see that now that running the Mevunet XDP X-MIT back, it's not the Mevunet XDP X-MIT back will reinsert the packet in the hardware, the MATX ring. And it's not important, in this case, to demerit the map, the buffer, since it has been already mapped by the PagePool API. We just need to flush the CPU caches, in this case, because the device is not current. Here we have XDP redirect. That XDP redirect is used to transmit the packet to, for example, a remote interface or a remote CPU or to even a socket using, for example, AF-XDP. And the typical use case is like, for example, the year two forwarding. It's important to notice here that in order to redirect to a remote interface, for example, the remote device should implement the NDO redirect X-MIT function pointer. And here we have the implementation done for the Mevunet. We notice here that, in this case, is necessary to demerit the map, the buffer, since it is being received, actually, from a remote device. Last verdict is XDP pass. XDP pass is used to send the packet to the standard Linux networking stack. And in the Mevunet implementation, we can see that we can rely on the build SKB. So there is no need to reallocate the buffer for the payload of the packet. But we need to take into account when we allocate the packet using the PagePool API that we need to take into account even the size of the SKB shared info. What we notice, moreover, in this slide, is that in this particular case, we are not able to recycle the buffer yet since whenever we need to refill the DMA engine with new buffers, we need to go through the standard PageAllocator. But as Ilya said, this feature is under developing. So in conclusion, we saw some XDP requirements and some basic information about XDP, like XDP memory-memory model. We saw some basic about the PagePool Allocator and how to implement each XDP verdict using this API. And we saw the Mevunet implementation as reference. Future rewards are definitely the adding support for SKB recycling for the XDP pass functionality, for XDP pass verdict. And for example, regarding Mevunet, we need to add support for the XDP support for the hardware buffer manager that is available on some device, like Solid Drunclear Fog, native support for AFXDP, and some interesting bits that are currently on the XDP roadmap. Questions? For me or for him? So I was wondering, for me or for him? Yes? No, that's Magnus. I don't know if he's in the room. Yeah, he's back there. I can repeat the question. One of the restrictions with AFXDP is that you couldn't use huge pages when you needed to map memory from the user space, right? The answer is that you can do it, but it's not internally optimized at the moment for AFXDP. Which interfaces? The vertio interfaces? No, no, no, the VH, VH. This depends on your, on the card you're on. No, I think it's software. Software implementation, yes. I've never tried it actually. We don't have any intention of working on it at the moment, so. Thank you.