 So, hello everyone. Thank you. Thank you for being here. I'm Contaminé. I'm working for Neutronome. We are doing network cards, smart nicks. In particular, we have a card that is able to upload some eBPF applications, so eBPF, XDP programs. And this is what I am about to present here today. So, this is rather low-level in comparison with most of the previous talks, at least. So, maybe just before starting with... It is an actual machine inside the Linux kernel. This was created first to filter packets. Now, it's a bit more generic than that. So, we have a variety of attach points inside the kernel for attaching various kinds of programs. So, we have TC for traffic control, where you can get some packets and process them. XDP, which is nearly the same thing, but at the driver level to get more performances. We have a lot of tracing and monitoring applications. You can attach programs to K-Probes, for example, to trace points. Regarding eBPF itself, the architecture is not too complicated. That's something close to an assembly language. So, you have registers and arithmetic and logic operations on those registers and their values. So, that's 1164-bit registers. You have a 512-byte stack. You can read and write data to the context on which the program is working. So, for packet processing, for example, you access directly to the packet data. So, you don't have to manually write every eBPF instructions in SMB or in binary. So, you have an LLVM backend that makes you able to compile from C into an object file containing the eBPF bytecode. So, that can be C mostly, but it can be also Lua or Go or Rust or other languages. And once you have your object file, you can inject that into the kernel with some user space tools, such as TCE, IP or other ones, depending on your use case. And it arrives in the kernels. That's something coming from user space. So, we want to make sure we don't hang or crash the kernels. That's a bit sensitive. We have security issues. So, we have a verifier inside the kernel that's supposed to check the program is safe and secure and won't create any issue inside your kernel. So, it passes the verifier. If it's not rejected, it can be interpreted in the kernel. For example, each time I receive a packet on my interface, I will run the program on that packet. Or it can be JIT compiled if I want to... So, JIT is just in time, just in time compiled to get native instructions to go even faster. It has a couple of additional features in regard with the previous, the older version of eBPF, which was only for packet filtering. So, now we have maps, which has some pairs of values and entries that can be stored inside the kernel memory, and that can be shared between different eBPF programs or also between eBPF programs and user space programs. We have, for example, hash maps, array maps in the variety of other maps for some specific use cases. We have tail calls that are some kinds of real jumps from one program to another one. This allows for changing several programs. You can have tail calls under conditions, so you can run this program or this program depending on a previous condition. You have helpers, which are some functions from the kernel. You have a wide list of functions, if you want, that can be called directly from the eBPF programs themselves. And this is to get the timestamp or to access to the maps from the eBPF programs or to get random numbers or to grow shrink packets. There is a lot of helpers available, and that's a variety of functions that would be otherwise somewhat more difficult to implement in eBPF directly. So that's a diagram of the eBPF workflow. Typically, I would write my program in C source code. I would compile it with LLVM or Clang and inject it into the kernel. So say I want to attach my program to the TC interface. I will use TC as a command line to inject that program through the eBPF C score. It will be checked by the verifier. If it passes the verifications, then it can be loaded into the program or possibly jit-compiled. And once it's loaded, I can attach it to, in this example, TC by using another code to a bpfc score. And I can interact with maps, possibly from user space, if my program uses maps, again with that bpfc score. And so that's the basics for eBPF with TC. And now we have another hook than TC, which is at the driver level. So the idea behind XDP for Express Data Path is that we want to get the packets as soon as possible, not to send them to user space such as with DPDK or other user space by kernel bypass solutions. We want to process them possibly in cooperation with the stack, but before they reach really the kernel stack. So the idea is to hook inside the driver and to run your bpf program at this point. So it's something quite recent. It's been the kernel since version 4.8 a couple of years ago. And so it's a fast data plane solution which is part of the kernel, which has been developed to improve the kernel performances for networking. If I take another diagram, I am still here in the kernel. I have my net devices. Here is the Linux network stack with, for our previous example, TC in grass, TC grass interfaces, sockets on which I could also attach bpf programs. But here I focus on XDP. So typically what we would have with DPDK would be sending all the packets to user space, process them in user space and then back to the interface. XDP is not the same. We have everything in the kernel. So I get my packets before reaching the kernel stack and I run my program to process the packet. I can have several written values from the bpf programs that will be interpreted, for example, to drop the program, to send it back to the same interface after editing. For example, if I want to just add an encapsulation, head out to my packet and send it back to the same port, I can do that. I can also redirect it to any other net device or forward it to the stack. If I want to have some more complex processing, the idea is not to make things too difficult to make complex processing steps with XDP, rather to have something fast for the simplest use cases. Otherwise, I will send my packets to the network stack instead. So I can still benefit from everything that's available from the Linux kernel stack. That's really a way to implement something fast inside the kernel. So what do we actually use XDP for? The main use cases would be load balancing and protection against the denial of service, distributed denial of service attacks. So mostly Facebook has been presenting some stuff about these use cases. Some other applications including distributed firewalls. And there are a lot more of use cases that are maybe not as much deployed, but very interesting too. For example, there is packet capture on Suricata that is starting to use XDP on the latest versions. For those who were here last year and attended my presentation, I was speaking about stateful processing for packets with the Biba Research Project, and it was also based on EBPF already. Selium also does interesting stuff for ensuring the security of Linux containers. So we do have a number of interesting use cases for BPF. It's already deployed. It's available in Linux. And then what we're trying to do from here is to get even more performances and to send those BPF programs to the device itself. So why do we want to offload? We want to offload because target for offload. It's pretty much self-contained. It's a program running on its own once it's sheet compiled. It's something that's performant. It goes fast. And if we want to go even faster and to take some of the load of the CPUs from the host, the only solution we have left is to run the programs on the hardware in the end. So that's what we're trying to do at Netronome. So we also want to have something that works with Linux, with the kernel. We don't want to have an appropriate language to perform offloaded applications on your cards. So we are using BPF and XDP. So how does that work? First, if you want to offload, you have to get the correct architecture. You must have a card that's able to offload BPF. So this is a simplified presentation, of course, of the Netronome NFP card. You have a NIC, you have a chip on that NIC. You have six islands per chip. Each of these islands for BPF uses ten cores. And on each core, we have four threads. On those threads, we have general purpose registers. We have also local memory, one kilobytes. We have other areas of memories on the islands or the chip on the NIC. And it so happens that we are able with this architecture to map the main components of BPF programs to the components of the NIC. So for example, we have our 11 registers used by BPF that we can map to the general purpose registers in the cores. I think we have 32 such registers, so that's enough for the 11 from BPF. We have the stack that is small enough for BPF to hold in the local memory of the threads on the cores. And we have the maps that we can store on the NIC with, there's some subtlety here. We're also trying to cache some entries of the map into other areas, in particular in the CTM here. But basically, it's too small to hold all the maps, so we're trying to take as much as we can closer to the cores. Otherwise, it's in the DRAM on the NIC. So we have this architecture. We also have to be sure that the processors are able to run BPF. So what happens in the none-of-load case is that BPF programs can be JIT compiled. When we are offloading, we're doing the same. We have our own JIT compiler, which is in our driver. So we are somehow calling our compiler to transform the BPF bytecode into native instructions for the card, which means that we also have to make sure that it can be run on 32-bit registers. BPF bytecode is 64-bit registers. We have processors only able to use 32-bit registers. So happily, we are able to use BPF in this way by turning it into 32-bit operations. And we also have a number of optimizations in our JIT compiler in order to improve the bytecode that we will run to save some instructions where we can to go faster on some specific sequences of instructions and so on in order to be as fast as possible. So we have our architecture. Now we must make sure also that the kernel is able to pass the program and the relevant metadata to the card. And for this, we had a lot of work to do on Linux itself. So this diagram is nearly the same as the one I showed at the start of the presentation. We still have a user program injecting a BPF program through the BPF-C score, then the revifier, then the program which is attached inside the kernel, possibly with the JIT compiler. Now if we want to offload programs, there is some additional work to perform at the driver level. In particular, we have a number of operations here that are exposed by the driver. So, for example, to attach programs to pre-preserverifier and so on, and also to make sure that the kernel knows that this program is offloaded. So previously, we would send to the BPF-C score the list of instructions from that program, its length, its type, is it a TC classifier, is it an XDP program, is it something else. And now we have a new value that's added for offload which is IF index, so the index of the interface which we want to offload our program. And by using these values, the kernel will be able to understand that the program is to be offloaded and to treat it as such. So we have those operations made available with the BPF and DO. And what happens is that when the program is offloaded, the kernel will first try to find for which device from the IF index the program is to be offloaded and to get the list of those operations. And then each time the BPF and DO is called after that, we are able to call directly the callbacks from the driver to process our BPF bytecode. So, for example, before starting the verifier itself, we will be able to call a function to prepare the verifying site from the driver. We have also checks made at the driver level to ensure that we can offload that program and we won't have security issues by offloading it. So, for example, we could be susceptible to reject programs that try to use some instructions we don't support, some BPF instructions we don't support or some helpers that we won't support. And this is all done by this callback that's also after the preparation of the verifier which is called each time the verifier checks an instruction. So, we have a single callback to make sure that every instruction is good for the kernel verifier, but we want to make sure that also it's good for the card. And this is something we can do with this callback. We can also call our translation function from the JIT compiler. So, that's actually not really the JIT compiler from the kernel anymore which is called. We call instead directly the JIT from our driver in combining two NFP instructions. NFP is the architecture of our card and so on. So, we have a list of callbacks here that are exposed through this NDEON network device operation. And all of this makes it able to get the programs in the correct shape to run them on the card rather easy. So, once we have our program checked and verified and JIT compiled, we can call TC again. So, by calling to the BPFC's call, the TC framework in the kernel will be able to get this uploaded object, which is the same thing here, and to attach it, for example, to the TC interface with the NDO setup TC, or if we don't want TC program, if we want an XDP program instead, that will be IP command and this will be this NDO again that we call the XDP attached function. So, just some additional remarks on this aspect. So, the verifier uses a callback to check every instruction. I've said that already. The driver has also its own error messages. We can improve the debugging by using, for example, the buffer log verifier from the kernel to retrieve errors at verification time, or we can use other mechanisms such as NITLY extended X, sorry, to get information in the console rather than in kernel logs when we have errors from the driver. So, that's just a remark. So, anyway, we have our architecture. We have offload supported in the kernel now, and we still have to update the tools that we need to use to work with BPF. So, for example, I'm using TC for offloading programs. I'm using IP command from IP root 2 package to attach XDP programs. By default, those tools are not supporting BPF offloads. So, what we had to do was to patch them to update the syntax of those commands to tell them that here, look, I want to offload this program to the NIC, so I want you to understand that I am trying to offload, and I want you to understand that you are supposed to pass the index as interface to the kernel. So, this is some work that we have done already. So, the latest versions are patched. We also have to make them ask to the kernel to create the maps, the BPF maps, on the device itself for offloading maps, programs using maps. And we also had to create or to update some other tools used with BPF, for example, BPF tool, which is now in the kernel tree, is a simple tool that's able to list load, pin, dump instructions for GTD or not GTD BPF programs. It works also with maps or with C groups. That's something we've developed. We've also patched LLBMMC in order to be able to compile from a human-friendly form of BPF assembly directly into object files. And this is, in particular, this is very useful to compile object files for some specific sequences of instructions. We don't have to edit binary files. That's pretty handful. So, once we have all of these, we still have to get, again, better performances, get new features. So, try to do our best to have everything work and work very well. So, what we have already, if I summarize what we already have in the kernel, in the cart, we have TC and XDP hardware of load. We are working on 32-bit registers. We have various GIT optimizations. Nearly all BPF instructions are supported. We support the stack. We support some helpers, not much, but a couple of them. We can access directly to the packet contents. So, read and write into packet data or packet header. We have several XDP actions supported, so we can send the packet back to the port it came from. We can pass it to the stack, so to the host from the NIC. We can drop the packets. We can create new headers to encapsulate those packets. We do support maps now, not for long. We have hash maps and array maps. We still have read-only from BPF programs at this time, but they can be updated from user space programs. So, we've tried to improve error messages, too, with BPF verifier log, XTAC. We have updated the tools that we use to work with BPF. And with all of this, we are obtaining pretty interesting performances. So, this example is an XDP load balancer, so that's the combination of two files that are two sample files that are in the Linux kernel tree. So, one is a TXIP tunnel, and the other one is a Layer 4 load balancer. And this program is something about 800 BPF instructions. There are four map look-ups in that program. We just removed per-CPU array maps. You have this in BPF. We don't support that in the card, so we are using simple maps instead, simple array maps. And so, with all of this, with what we have on the card, on the kernel, we are obtaining more than 40 million packets per second. So, that's pretty good for now. We are still trying to improve that to get more features. So, we are working on adding the redirect action for XDP, atomic add operations. We're trying to improve map management by caching some entries from the map. Packet management too. We would like to take the packets closer to the cores on the chip. We are trying to improve 32-bit instruction support. In particular, I would like to have Klang and LLVM generate code that's easier to use with 32-bit registers. We want to remove some locks that we use in the firmware for the maps in order to double memory bandwidth for the maps. We have also worked on tail codes to be able to chain several programs. So, several programs in the NIC, and possibly also, if we don't support some feature that's required for one program in particular, we would like to be able to continue chaining programs on the host possibly. That's something that's in progress too. We want to improve BPF tool still. We cannot, at this time, dump JIT-compiled instructions because we have a patch for being used which has not been pushed upstream. We have more JIT-compiled optimizations in work too. So, a lot of things we're trying to do to push for the next Linux releases. And yeah. So, basically, that's it. Still a lot more work to do. So, just to summarize, EVF, XDP, it's very efficient. We are trying to get the load of the CPU as much as possible so to offload all of this. But still, it's a way to cooperate with the Linux kernel. We are not trying to bypass everything. We are trying to get it from the kernel to possibly pass packets to the stack if that's required. And so, the good thing for every people, every other companies that would like to try to overload BPF in the future, that everything on the kernel and the driver are upstream. So, it's in Linux already. So, that's a good thing, I think. So, that's it. Let's know if we have time for questions. Thank you. There are questions. Three minutes. Can we share a map between several XDP programs? I don't know. On Linux itself, you would only have one XDP program at a time. So, you would like to share that map, which is on the card itself, not on the host. You would like to share it with other devices. I don't think we can do that at this time. I don't know what performance is for the remote access system the map would be. So, at this time, we could share them with anything that would update the map or read the map from user space. If it's on another leak, I don't know how we could do that yet. Not at this time, I think. So, there's time for one more question. But between XDP and DC probably, for example. Okay, so the gentleman here. Okay, so I'm going to repeat the question. So, do you have any support for transmit queues? For transmit queues. So, on the TX side. So, XDP is only on the RX side. So, if we're talking about BPF upload, no, it's ingress. On the TX side, I don't... Okay, so I think we're out of time. Thank you. Thank you very much.