 Hello, good afternoon. I'm Dave, I'm a software engineer at Red Hat. I'm joined by my colleague Mariam, and yes, we're going to be talking to you about, there we go, migrating from DPTK to XDP for high performance networking in communities with the help of BPF, man, the title was so big it barely fit on the slide. I am a very minor part of the story, so I'm the bit on the end, the of BPF man. Mariam has all of the cool content about the actual AFXDP DPTK migration side, but we're going to run through it in reverse order. I'm going to tell you a little bit about BPF man, what it is, why it exists and how it's helpful in this case, and I'm going to hand over to Mariam to take you the rest of the way. So, BPF man. It was not always called BPF man. It was a project I started about just over two years ago. It was called BPFD when it started. Some of you may have heard me talk about it before. Anybody? Maybe? Yeah, a few people. Fantastic. Good, good. So BPFD, the whole idea behind the project came from me starting to play around with BPF, starting to deploy some programs on just Linux hosts and then furthermore into Kubernetes and realising it was a real pain. There are a lot of annoying things in the BPF API, how some programs behave differently to other ones. Some will stay attached once the program that loaded them exits, some will disappear. And I felt like we could do a better job of abstracting some of that complexity away, and initially I thought, hey, we're going to write a demon. It's going to be like SystemD, but it will be for BPF and it will be amazing. As time passed and some conversations with folks in the community, we realised that a demon might not have been the best choice and we also realised that there are lots of other people loading BPF and we might not always be able to kind of police or look after what they were doing. So in the last six months we've completely rewritten BPF Man from the ground up. The name is a contraction between BPF Manager, BPF Man with a slight fedora hat tip over to Popman who trailblazed the way in the container space. So we've now decomposed BPF Manager to being more of a tool set, some suite of tools that you can use for managing your EBPF in cloud native environments. So we're paving the way for some cloud native EBPF. So what that means in terms of BPF Man and the way that we're doing this is that we have custom resource definitions for BPF programs inside of Kubernetes. All of those CRDs are served by controllers. We have a CSI plug-in that does BPF file system provisioning and everything's packaged up using Kubyamol. You can install it as an operator. You can install it using Customize. There's a whole of a few myriad ways of getting this installed on your cluster to help with your BPF program management needs. Now where some of the cool stuff comes in, at least as far as I think it's cool, in order to make this work in Kubernetes, the first problem we had to solve was how do I take my cool BPF program that I wrote on my machine and get that onto all of the nodes in my cluster. And we figured we already solved that problem for containers, so why not just reuse that entire same pipeline. So we're able to package BPF bytecode into container images, push that through a registry and then BPF Man's able to pull that down, load into the kernel, all of that good stuff for you. The neat thing about that is that all of the tooling for a software supply chain now can equally apply to BPF programs the same way that it does to the rest of the stuff there. We can use tools like Cosign to go and sign our containers, which gives us signatures for the stuff that you're loading into the kernel, which is kind of neat as support for that in BPF land just hasn't quite matured yet. We're able to do kind of signing today. And it allows us to verify with a little bit more work where these programs that come from. So a little piece of work I'm doing on BPF Man at the moment is trying to integrate some more of the policy based verification. So you could say I would like to trust BPF programs that have been published and signed by somebody with a redhat.com email address or an ice availing email address or a data dog email address, but no other programs. So we'll have support for that very, very soon. Finally, I said that BPF Man is now a suite of tools. So aside from loading BPF programs, we're doing some things around the observability of BPF programs as well. So we have two other little tools, one of which does BPF metrics export into hotel. So that will grab some interesting metrics from the BPF subsystem about loaded programs, about maps, about memory usage and explore that in a way that you can use it through your existing observability tools. We also just recently merged a log exporter, which is grabbing logs from audit D. So it will tell you every time a BPF program has been loaded on the system. It will give you some context over who loaded that program. And again, you can ingest that into your telemetry and use that in any way that you see fit. So in terms of how BPF Man works with program loading, we have a very high level declaration of what you want to do, which is your CRD. In this case, it's, hey, BPF Man, please deploy this TC BPF program to the network interface ETH0. And that is the state that gets pushed into Kubernetes. And BPF Man's controller will pick that up and figure out that it's got work to do. The only part of this that runs privileged is BPF Man. So BPF Man is the only thing in this environment that needs the permission to load BPF programs. It's able to fetch your kernel code from an OCI image, pull that down, load that into the kernel, verify the signature. And then in order to make the user space site unprivileged, we use the BPF file system, which is a virtual file system available in Linux. So we're able to go and pin all of your maps onto that so that the user space code without any additional privileges is able to go and read those maps. So if you've got a read-only workflow that you want to use, then you can be completely unprivileged inside of Kubernetes, which is something that we really, really wanted to enable. But this workflow is not for everybody. If you are doing things like compiling programs on demand or your eBPF use case is a little more advanced, it's not going to work because you can't always push your stuff over to a registry. So what we have coming in kernel 6.9 thereabouts is a new patch called BPF tokens, and what that will allow BPF Man to do is here act as a token issuer. So we can keep the same architecture. In this case, you want to deploy a pod that's using BPF programs. BPF Man will then go and provision a file system for you, and it can then go and issue the program with a token to say, hey, you're allowed to make these BPFs this calls. It's all good. You don't need additional privileges to do that. Your token is your privilege, and then you can equally run an unprivileged mode making the BPFs this calls to load your programs. So that's still work in progress. We're figuring out how exactly to make that work, but that will be landing as soon as we can kind of get our teeth into the kernel patch and figure out how it's going to work. So with that, I will hand you over to Mariam to talk about AFXDP. Okay, so I guess let's start with what is AFXDP. So it's an address family that's optimized for high performance packet processing applications. It promises DPDK-like performance, but with a standard Linux networking interface. It literally attaches a VPF program directly onto the interface, and if we want to be really specific, it's an XDP program that's capable of redirecting packets to an AFXDP socket that sits in users. And then your application consumes things from that AFXDP socket itself. So, you know, you might be wondering, well, you know, is it supported in DPDK? What's the story there? And actually AFXDP has been supported in DPDK since its inception. It's available via the AFXDP PMD. And what I really want to highlight here is that that PMD is actually using your standard Linux I40E driver. It's not using the, if you're familiar with DPDK, it's not using VFIO PCI or IGBUIO. So in essence, it's the actual Linux net dev that's being consumed by this AFXDP PMD. And if you want to use it as part of your DPDK application, you simply need to pass a command line argument, which is the argument we see here on the slide, this minus, minus VDEV argument. What's really cool, or what I think is cool, is that it actually gets consumed as part of the core arguments at every single DPDK application startup. So if you want to transition your DPDK app from using its existing PMD to AFXDP, all you actually need to do is pass this minus, minus VDEV command line argument and boom, it's done. No other application changes should be needed from the DPDK point of view, but obviously if you're deploying with Kubernetes, then you might want to take a few things into consideration and we'll talk about that in a moment. But I guess you might be wondering, you know, why? Why would I want to do this? Out of the box benefits would include observability and portability, and even maybe from a sustainability point of view, if you want to move from a polling model, something that's purely interrupt driven, then this is something that you would leverage. The other thing that you get out of the box is that you can use any standard Linux tool for configuring, managing or observing that network interface. And then, you know, the higher investment benefit is that you can pretty much use a cups or a control and user plane separation type architecture for your application where you essentially split the networking stack where the kernel is your control plane and your DPDK app is your data plane. And you could accomplish this by writing something as simple as a net link agent, for example, that relays routes or other relevant information up into user space for your app. So you get this hybrid networking stack, as I like to call it, and the packets get automatically, you know, diverted to the right point for processing. So if it's control plane traffic, it goes straight to the kernel. If it's data plane traffic, it goes up into user space into your app, and your app only has to worry about processing the traffic that's actually relevant to it, and you avoid this previous model of having to re-inject packets into the kernel from user space. Okay, so provisioning. If you provision a DPDK application or service today, you're probably using the SROV, device plugin and CNI, you could continue to use them. The issue would be that your pod would have to be privileged. Your application pod would have to be privileged to use AFXDP in that context because it would have to load and unload the BPF programs for the networking interfaces. So if we take a moment and we set like two provisioning and deployment goals, the first is that we want to be able to provision, advertise and manage a set of secondary networking interfaces for pods that want to use AFXDP, and that those pods should be able to run without escalated privileges, then you need to use the AFXDP device plugin and CNI. Essentially these two components complement each other to achieve the goals that we list above. So just to walk you through maybe a snapshot of a pod that's created, and this pod is requesting an AFXDP resource, and we're just going to zone in on the net dev allocation time aspect of it. Cublet would invoke the AFXDP device plugin with this allocate device function call. The AFXDP device plugin before BPF man integration was doing this loading of the EBPF program, which is the privilege operation. And then it would pry back to Cublet saying, yeah, add this device to the device spec, but it would also request it to mount a UNIX domain socket into the pod. We'll talk about that in a moment. And then at some stage the AFXDP CNI would get invoked, and this CNI has a number of things that it needs to do, but one of the things that it could do alongside your BPF program, of course, is programming hardware filters on the NIC itself, so packets are delivered to the right cues. And then at some stage your AFXDP pod is up and running, your AFXDP application starts. OK, so now we want to create an AFXDP socket. What do we need? Well, we need a reference to the BPF map that was loaded onto the net dev by the device plugin. That reference is essentially a simple file descriptor, and this is where the UDS comes into place. So the application previously had to connect back to the device plugin over UDS, do this handshake, get this file descriptor, and then create the AFXDP socket, and continue packet processing and doing all this kind of stuff that it was intended to do from the beginning. And this is how we were able to achieve this kind of unprivileged model in that we exported the privileged operations out into the device plugin. And I just wanted to highlight one other thing, I guess, is that in the pod deletion path, the CNI is what's invoked on pod deletion. The device plugin has no awareness of deletion time, and so it had a responsibility of unloading the BPF program. So if we consider those two privileged operations, the loading of the BPF program and the unloading of the BPF program, they actually created quite a number of requirements for the AFXDP device plugin and the CNI. Firstly, we had to store BPF programs in our codebase, which created a lot of churn, especially when not a lot of our engineers wrote the BPF programs. And it also meant that we had a very limited offering of what programs we could offer to folks on the networking interface. Secondly, the CNI runs as a binary on the host right, so you have to statically link all of your dependencies into the CNI, which makes it bulky. You couldn't assume that LibBPF or LibXDP were available on the host, and also had BPF code integrated into the CNI itself. And lastly, we had to do all this loading and unloading of programs pinning of maps, and every time we went to upgrade like our base container image for the device plugin, there would be a different version of LibBPF supported, and we end up having to do some little bits of rewrite here and there, which actually again created quite a bit of churn for us in the project. So, we decided to integrate with BPF man, and what we found were the benefits of that integration essentially was that we were able to remove all of the BPF code from our codebase, and basically significantly reduce the churn for us. It also meant that we no longer need to statically link any dependencies into the CNI, simplifying it, making it smaller, which was another advantage. Through the bytecode OCI image specification that's supported by BPF man, we actually can support a much more diverse set of programs for the pod, you know, it can request anything as long as it has an OCI image somewhere, and BPF man has access to that image. As a next level kind of integration, which is ongoing at the moment, we're also integrating with BPF man to take advantage of their container storage interface driver, which would mean that our BPFFS is actually completely pinned inside the context of the pod, so we don't have to do this, pin something on a host, share it to a pod in a bidirectional fashion for the pod to access it. So, if we review that snapshot again with a couple of updates, so without the CSI plug-in, you know, that alloc function call comes into the AFX3 device plug-in, it uses the BPF man client APIs to request it to load the program, the BPF program on the interface, and then the device plug-in reply back to Kubelet, this time with the device to add a device spec, and before CSI integration it would also give it the pinned, map mount point to mount into the pod, and so eventually when your pod comes up, you might notice like there's no handshake anymore, your pod can simply do BPF object get and it can start running and executing straight away, and I think getting rid of the handshake to be honest with you is a benefit for us because it placed a requirement on the application that I don't think it actually needed now that we can use BPF object get. And so, with the container storage interface integration side, just like one other addition into the picture, so let's start at the alloc call coming into the device plug-in, BPF man loads the program on the interface, and the device plug-in now is just going to set a few annotations on the pod, which are the XDP program name, the map name, and where to pin them inside the context of the pod. Kubelet will ask BPF man to actually mount that volume inside the pod context, and when the pod is up and running it has a completely cordoned off BPFFS with the map that it needs to itself, does BPF object get and off it goes. So, you know, theory is obviously much more strong if it's supported by evidence, and what we did was we actually took the OpenMobile Evolved Core, which is an open source 5G core data plane, it had this UPF implementation specifically that was based on DPDK and SROV. We migrated it to AFXDP using that minus-minus VDEV argument that we saw earlier on in the slides without making any changes to the app, and we actually have a demo recorded, but I won't show it today, I will leave people to watch that in their own time. It's showing basically an unprivileged OMEC UPF pod coming up and it's being deployed by AFXDP device plug-in, CNI, and BPF man. In summary, you know, we found that there was a real development and maintenance cost to independently trying to manage BPF inside our own infrastructure entity, and essentially, you know, I think the device plug-in and the CNI are a real use case that show the benefit of leveraging a centralized manager in Kubernetes. Thank you. Thank you for that talk. Now we have time for a couple of questions, if there are any. Did you agree on an interface between BPF man and AFXDP device plug-in to make sure that, for example, BPF man never breaks AFXDP device plug-ins? I don't think we had to make any agreement to be honest with you. BPF man works in a pretty standard way and how it loads programs on the interface. So the change was actually, I'm going to be really honest, really simple. So I just replaced all the code in the AFXDP device plug-in that was previously loading, unloading BPF programs and so on with a BPF client API call. It was that simple. Hi, great talk. My experience with DPDK was usually to run it in a VM, so my question would be, do you have experience with QMU and AFXDP, or do you know if that works well? I've seen patches fly through on the mailing list quite recently to integrate it, so I imagine, you know, native support would be coming relatively soon in that environment, yeah. One question about the slide where you are running unprivileged BPF programs. Do you happen to use it, right as a non-root user, or does it need to be a root user to run that? So I was taking a very simple DPDK app, which was already running as a root user, but I think it should work in a non-root user context. I haven't tried it. I was trying to work on the premise that I was using DPDK as it would be deployed today. Right, okay, that's good to know because I tried this. We couldn't get it to work. This was like several months ago, so things might have changed. The other question to this was, did you have to enable that unprivileged BPF flag in the kernel? No, but not for the integration mark that happened, no, but in previous work when we were using the device plugin natively we did, yes. Right, and what kernel version were you using for this? For this demo that we did or just in general, I think I used... Were you quite working? Yes, so it was an Ubuntu image, and I think it was using a 5-dash... More than 5.19 it was like 5-dash something kernel, but it was above 5.19, yes. Okay, that's fairly old. Thank you. Yes, the image for the Omec UPF was quite old, so I didn't have much of an option unfortunately. It was LTS, Ubuntu and kernel. Okay, thanks for the talk. We're now going to... ...