 As I already mentioned, I want to talk about EPPF, XTP, what it is, and what are some recent developments in Linux kernel, a community to start off, what is PPPF? Probably everybody of you has used TCP dump in one way or another. If you enter some filter in TCP dump, that filter gets compiled down into some byte code. This byte code is called PPPF. This is done by the P-CAP and then pushed into the kernel in AF packet. And that is already existing for a very long time. And nowadays we call this classic BPPF or CPPF. And it's got extended heavily, probably like one or three years ago in the kernel, into something that is called EPPF. And EPPF is an efficient generic in-kernel byte code engine. It is used in many different subsystems in the kernel right now. There are probably three major areas, which is networking, tracing, and also sandboxing or security regarding tracing and sandboxing. So tracing, for example, you have Perf where you can attach BPPF programs to trace points, K-Popes, things like that. And then every time an event triggers, an EPPF program runs, and then you can analyze different things in the kernel, you can read kernel memory, and so on and so forth. For sandboxing, probably the most popular, the most prominent user is the Chrome web browser, where it uses second BPPF to filter on system calls, to white list system calls, and so on. So this is quite flexible. That second BPPF is not EPPF yet, it's CBPF. So still this classic BPPF that is pushed down into the kernel. And networking, which is what we want to talk about here, there are various things, for example, in sockets. You have socket filters with AF packet, what I already mentioned. Then you have various DMUX facilities, for example, SR reuse port, when you have multiple sockets that are in a group that can reuse a specific port, then you can have an EPPF program that makes a decision to what port that connection is going to be. You also have for AF packet, for example, fan out, where you have multiple AF packet sockets, and then the traffic can get load balanced like that to one of those sockets participating in that group. And then you have XDP and TC. So I will start with the facility that we have in TC. This is called COS BPPF or BPPF classifier. It's called like this, I mean classifier for historical reasons, but it can do way more than just classification. So it's a flexible, programmable packet processor in the TC subsystem. You can attach an EPPF program there on the Ingress or Equus path of the kernel data path of networking device, and then we also have XDP, which got recently introduced into the kernel. This is also programmable, but high-performance, in kernel packet processor. That is attachable to drivers that support it and on the Ingress side of the driver. I will talk a bit more about that later. I think both things, they're complementary to each other in the sense that COS BPPF can be attached to all networking devices, also virtual ones like WEEF devices for different networking namespaces, and it can be attached in Ingress and XDP is currently on Ingress. And for COS BPPF, you have socket buffer, which is the networking representation, at the kernel representation of a network packet, as you have that as an input into the BPPF program where it works with that data, and you have a bit more richer context with that. But both are complementary to each other and quite flexible. So first, before I dive into the details of both subsystems, I want to present shortly about the BPPF architecture. It consists of 11 64-bit registers. You also have 32-bit sub-registers there. You have a stack that is limited. It's currently 512 bytes on the kernel stack. Implicitly, program counter, of course, the instructions, they are fixed size, 64-bit wide, and you can have maximum of 4,069 instructions per program, which doesn't seem much, but you can still do quite a lot and quite also complex things with it. Compared to the CPPF that we know from TCP amp, like what I mentioned earlier, there are various new instructions. Of course, you have the whole 64-bit range of instructions for other operations, for example, because the CPPF was only on 32-bit, and you have things like you have a call instructions where you can call into a helper function, what I will explain that later. So there's various new instructions. Some of the core components of the architecture is that you have read-write access to the input context. The input context, as I mentioned, for example, in TCP is a socket buffer. In XDP, it's like a small representation where you can access the DM, like the network packet, for the XDP representation. Then you have a helper function concept, which means that the kernel exposes a fixed defined set of helpers or functions that the BPPF program can call into. So it cannot call arbitrary kernel memory, but just a fixed set. Then there are maps. Maps are efficient key value data structures that can be shared arbitrarily, which means one or multiple BPPF programs can access that map, or new space can access and read or update data from it at the same time. So then there are tail calls, tail calls, the concept where one BPPF program can call another BPPF program. So you don't have this fixed limitation of 4069 instructions per program, but you can also go beyond that. And this is quite useful. It reuses the same stack frame, so it's also quite fast. Then there's a concept of object pinning, which means that from user-based perspective, programs and maps, they access through file descriptors. And once you load the BPPF program into the kernel, for example, then your file descriptor is basically unavailable for other programs to use. And to share maps with also other applications, you can pin them in a pseudo file system as a node there, and then those other programs that can retrieve it and also work on this map to update or read data. Then in the kernel, there's a CBPF to EBBF translator, which means that whenever you load your classic TCP RAM filter in the kernel that gets transparently translated to BPPF and only runs on EBBF. And the nice thing about EBBF is that from user-based side, LVM has an EBBF backend, which means that you can write a C-like program and LVM translates that into an object file, which has EBBF instructions that can then be loaded into the kernel. And last but not least, there's not only an EBBF interpreter in the kernel, but also a just-in-time compiler. So whenever you load a program and the just-in-time compiler is enabled, then this gets mapped directly to native opcode so you can run with native performance. All of that is managed through the BPPF system core. So it's like a stable API and also your programs. When you implement them, they are basically stable. So once your kernel supports that, it's like a guarantee that newer kernels will also support that in the future, unlike, for example, kernel module where internals can always change. So to give you some more details regarding TC, you can basically attach this CLS-BPF to Q-disk. And there's like a Q-disk, which is called chat-cls-act, which is just a pseudo-Q-disk, similar to the Ingress Q-disk in the kernel. And this is only there for the purpose of attaching classifiers and actions if you're familiar with TC. And in that case, what we want to attach is, of course, CLS-BPF. There are two central hooks where this chat-cls-act Q-disk can attach to, which is the Ingress and the Egress cook. So they are basically really central places where every packet has to go through from Rx and Tx side. CLS-BPF, it runs EBPF. Historically, it also supports CBPF, so both flavors. And you can automatically update your programs during runtime, which is really useful. Either the root program that is attached to the CLS-BPF classifier itself, or those tail calls where one program calls into another that can also automatically get updated during runtime without restarting your networking interface, things like that. And CLS-BPF has a fast path, which means that historically, TC has classifiers and actions, what I mentioned. And classifier, once it's done classifying, can then call into a chain of actions. And all of that is pretty inefficient. And EBPF can do everything contained in itself anyway. So therefore, this CLS-BPF has a so-called direct action mode where EBPF can classify a mango, perform various actions on the SKB, on the socket buffer that is an input, and then just return a verdict and then be done with it. So it doesn't call into a chain of actions or whatever. Everything is done in the EBPF program itself. And there's also an offload interface available. For example, the Netronome NFP driver, if you have such a Netronome SmartNIC, can offload EBPF programs, which means that whenever you load your program to CLS-BPF and have such a card, then it gets translated into a native code that is supported for the instruction set on the NFP driver. So a typical workflow would be your IGC program. You compile the LVM. It gets translated into ELF instructions, sorry, into EBPF instructions. They are contained in an ELF in an object file. TC, in that case, can read that object file and load those instructions to the kernel. In the kernel, they get verified, which means that verify and make sure that the instructions cannot crash the kernel. It cannot create infinite loops, things like that. Then it gets just in time compiled. Then this is pushed down into CLS-BPF classifier and then eventually gets offloaded. So that's typical workflow for that. And XTP, on the other hand, this was introduced probably one year ago into the kernel. The objectives there is that it's really tailored for high-performance packet processing, so what it does is it runs an EBPF program at a very earliest point in the driver. So this is earlier, so even before you allocated an SKP metadata structure, this is also one of the overheads the kernel have for doing high-performance packet processing with small packets in particular. And the nice thing is that this kind of framework works in concert with the kernel, which means it uses the same security model. You don't have your NIC representation or your driver in user space and have it exposed there, but it still stays in the kernel. There's no out-of-tree module needed for accessing that. And the packet itself, it also stays in the kernel, which means you don't have to cross boundaries when you want to get it out or when you want to push it back into the kernel, which is, for example, nice if you want to record containers, for example, one of the use cases could be that you're implementing a firewalling application, and there you can already filter packets out at the very earliest possible point on your NIC, and then from there, part of those packets that can get further passed to the kernel stack and then into some of the containers. Other use cases include load balancing. For example, director of return load balancing is currently possible with that. And in mangling and forwarding or anti-denial of service measures, monitoring is also possible. So there's currently an interface where you can push packets to user space for sampling and monitoring from XDP side. And you have various radics similar to TC, which means either you can drop packets, pass them further on to the kernel stack, or transmit them out again. And there are currently a number of drivers that support that, for example, Melamax drivers, MLX4 and MLX5. Then we have the Netronome driver again, and QLagic, there's recently where they are net support that has been merged. And Intel, I4DE was posted to the NetDev list. So it will be merged soon, and as well, BNXT. The drivers, or at least some of them currently depending on the implementation, but it will change in future that all of them will support that. Most likely, allow also for atomic updates. So once you have your XDP program loaded and you want to update it with a new version, this can happen without disruption of any traffic. And also XDP has an offloading interface. And again, also Netronome has support for that. So once you load your XDP program, Netronome can offload that with a couple of limitations currently. But it's already a good way forward. And the workflow here, similar to what I mentioned earlier, you have your C program, your LVM, and this time IP. So IP link command can be used as a loader for pushing the BPF sequence down to the kernel. And some of those features, I mentioned their maps, which are efficient key value stores. You can, from an BPF program, lookup, update, or delete elements. Currently there are a couple of flavors, which are array maps, hash tables, least recently used maps, and longest prefix match, which has been merged not too long ago. They are all available for both types. And for the array, hash and LRU, you also have per CPU variants. And there's also a possibility to pre-allocate the whole memory. So you have a pool, and you don't have to allocate your elements through the normal kernel allocation facilities. Then there are also specialized maps, which means they can not be used like those generic maps, but they're rather used with BPF helpers. For example, the program array map, which is used for tail calls, where you can push file descriptors into this specific map, and then you can jump into that through that. There's direct packet access, which means in the support in both cases, CLS, BPF, and XDP, direct packet read and write means that in the past, CLS, BPF had to use helper functions, where you pass in a stack buffer, and then it gets filled with the offset a number of bytes you want to read, but direct packet read and write means you can, for example, just cast your header directly without having to go through helper. So it's performance optimization there. You have a couple of additional metadata. For example, in the socket buffer case, on CLS, BPF, you can use SKB mark, for example. There are various other things there, but so it can work with other facilities from the kernel as well. Regarding packet forwarding, so CLS, BPF, as I mentioned, it can forward packets to any networking device, including virtual ones. So when you push that into container over a reef, for example, you can push it all to the same port, or it can loop it back to the RX path. For XDP, it's currently still limited. So in XDP case, you can only push it out to the same port it came in, but there's work in progress to also support multi-port TX or TX to a different net device that also supports XDP. That will come in future, but currently, to the same port, it's still useful, because, for example, when you have the load balancing use case with directory return, your node that runs, or your local answer that would run XDP, gets the packet in and rewrites various parts and push it out again, but the reply packets, sends its directory return, go a different path. So it's still very useful to do that. That's a miscellaneous thing. You can do encapsulation. So there's vxlan.geniv.ge, IPIP encapsulation available for zero SPPF in XDP. You're way more flexible, actually. You can do any kind of encapsulation you want, because at that point, at that early point in time, the kernel doesn't know anything about the packet that is coming in yet. So you're not limited by the current internal details from the socket buffer itself. Then there's also event notification, which is a really useful feature. What it does, you can have custom, your own custom structs in the eBPF program, and then push that to a high-performance, Persepule memory map ring buffer to user space. That can be some specific metadata or data and packet and the packet data itself up to the full packet. So that is useful if you have some management demons that then listen on certain events and update later on map data or programs for tail calls and so on. You have checks on mangling, you have C-group support and various other things. IPv2 as a later, I already mentioned that it can push down the eBPF bytecode into the kernel. Just to give you an example of how it looks. So for example, in the C-case, you set up the Q-disk first, then you can set up the filter, and then you hear you define whether an ingress or egressite. You want to hook that eBPF program to, and then you define the object itself. And then the object can have various sections which contain the program code, the actual program code. And in the STP workflow, it's currently where you can say I want to load the program via IP link, and then as well specify the object, and then it gets pushed down. And they have a common chat library as the loader backend. So it doesn't necessarily have to be IPv2. You can also have other loaders. If you program them yourself or other libraries that support it, so it's just one flavor where you can do that. Then regarding just-in-time compilers, there's X86 support, ARM64, PowerPC, and S390X. So they all have eBPF offloads. PowerPC got recently merged, which is nice. And the same for the NFP driver, which also implements a just-in-time compiler for the NFP-specific instruction set for the networking offloads. There are various measures against hardening those chosen programs. I'm not going too much into detail here. And there are other recent improvements, for example, LVM's eBPF backend recently got 12 support, which means you can annotate your code with the actual generated object code with the actual source code that you have. And it's definitely an improvement for the bugging. There are various verifier improvements that can recognize the LVM generated code a little bit better in the kernel, so it accepts those programs as well. And trace points have recently been merged and so on. So a couple of next steps. So definitely, there still have to be improvements from the verification side in terms of logging or provide more helpful user messages, for example. Then in terms of search prunings that the verifier can more aggressively tell you that the verifier doesn't have to do too much work to verify each possible path of the program. And then there has to be better XDP driver support in the future, but that will come over time. And so those are some examples of next steps regarding code. If you want to look at some code and how the eBPF parts are with the data, I can recommend you to go to github.com. Slash Syrian, which is a project which implements, basically, eBPF and XDP for containers. There will be a talk afterwards for that and everything else on the kernel. An IP route to site is merged upstream. And for further information, you can also look at the net dev conference papers, the data there, or the kernel documentation, and so on. Are there any questions? All right. There is a symmetry there between the features supported between CLS and XDP. Do you expect convergence over time or I know there are two different levels in the stack. Would you expect them to converge over time? Right. I don't expect to converge all of that, because those are two different layers of the stack. And for some situations or use cases, one thing might be better than the other. So it's not like it would be exactly the same or something. So as I mentioned earlier, they are complementary to each other. So, yeah. All right. Is it possible to modify packets in XDP based on states? Like when we want to do different IP before and not or something like this? Right. So you definitely can. OK, I should repeat the question. Is it possible to modify packets based on state, for example, for nothing? So it's definitely possible. So a state you can hold basically in PPF maps. That's what they are there for. And you can modify. So as I mentioned, they are direct read, direct write, possibilities. So, yeah. So, for example, does that. Does it also work when you talk to the hardware? So I'm currently not quite aware what features Netronome support, but I think they should be able to write as well. OK, but the one limitation for the Netronome hardware is that they do not yet support PPF maps. But that will come in future. So there's work in progress from Netronome site. I don't work for them, but just to tell you what I know. OK. So one of the next steps of XDP is to be able to forward to any port. What is the most difficult in this task? I think the most difficult task is the. So one of the challenges in XDP is that you forward this to a different port. And what is the most difficult task there? Is that you need to have the. Well, so there's currently work in regarding having a page pool allocator, which means that if you want to forward that to a different NIC even, they have to have some common understanding on how to transfer the packet from the site to another, right? Without copying, for example, so you need to have like a common shared page pool construct that you can just transfer those pages to the other one so that they can TX that. Yeah, so there's still, I think, quite some work there to get that supported. Yeah. So what about the locking logic? Because right now, the execution has to be better. What, the locking logic? Yes, is it for maps? I'm showing you that. Ah, for maps. So what about the locking logic for maps? And yeah, so for example, every maps, on every maps, whether you have a per se view or not per se view, you are basically on your own, which means that the program has to be written in a way that it doesn't have race conditions. So you have some instructions, for example, atomic add where you can use to increment counters on that side. For the hash tables, you have basically the update helper call, which automatically replaces one hash table map element with another. And so yeah, so there are various, it just depends on the map, certainly. But there are various performance improvements to be done still. All right, you can also grab me on the hallway if there are further questions. So thank you very much.