 Welcome to DevCon, welcome to our session. So my name is Victor, this is Artem, we're both from Redhead, from the core kernel team. And we'd like to tell you something about what happened in the BPF world in the past, let's say one, two years. Let me start with some, a bit of statistics. You can see the number of commits and the number of changes in the BPF subsystem in the kernel throughout the past two years. And you can clearly see that the trend is rising, so BPF is really getting a lot of attention upstream. There's a lot of new work appearing practically every day. At the beginning, when we set up this talk, we put down everything that we found interesting that happened in the past two years. And the overall talk took like something more than one hour. So we had to pretty much cut it to less than half to match the schedule. So we won't be covering everything, but we'll be covering the most interesting parts from our point of view, which are especially interesting to people who try to develop their BPF applications. So maybe let me first start with a quick poll. How many of you have ever written a BPF application in any sort, in some framework, whatever? Okay, and much? Cool. So this talk will be quite technical, so hopefully you have at least some basic knowledge of BPF. Let me first maybe do a quick recapitulation. What is BPF or EBPF? I'm going to use these two terms like one today, EBPF equals BPF. So basically it's an internal, let's say, virtual machine which allows you to run your own programs inside the kernel with current privileges. Those programs are written using so-called BPF instructions, it's a special language for that. And one of the most interesting parts about this is that there's a component called BPF Verifier, which checks every program that you try to load into the BPF subsystem into the kernel. And it checks that the program that you're trying to run, or you will be running, is safe to be run inside the kernel. So it doesn't crash the kernel, obviously. It doesn't hang the kernel, so the program is terminated and so on. This is quite a strict component, and we'll be talking about it a lot today, so that's why I'm mentioning it. So once you have your BPF program loaded into the kernel, you can attach it to various events in the kernel, such as K-probes, sockets for some network filtering, C-groups, and so on. There's so many events that you can attach BPF programs to these days. Okay, we're going to split the talk into several parts. And the first part is we'd like to introduce you some new features that you have as BPF program developers that you have available and you can use to write your BPF programs, or you enhance your BPF programs. So first of all, one of the most interesting things that happened in past years in BPF are so-called BPF kernel functions. If you've ever written a BPF program, you probably know that it's not easy to call kernel functions from BPF programs. You can't call anything, because obviously that would be too dangerous. So the first way to approach these are so-called BPF helpers, where the kernel contains a list of functions which are allowed to be called from BPF programs. They were quite difficult to be added, so the list of helpers didn't grow much throughout the time, but there was found that there's a necessity to come up with new functions that can be called from BPF programs. And that's why the concept of BPF kernel functions was created, which are much more easy to create and edit. You basically just need to annotate the function inside a kernel with this special annotation and register the function for the correct program type, and you can then call the function from your BPF programs. One of the nice things about Scaffans is that they allow the verifier to perform additional checks, so that it can check that usage of these functions is safe inside your BPF program. This is done through, again, another annotations. I will be showing an example of those in the next section. Maybe an example would be great. We have one example, by the way, which was written by Artem here, where he added a new keyfunk for calling the crashKexec function. It was as simple as this, so he basically added this function to some list of functions, and then this list of functions is registered to be allowed to be called from certain type of BPF programs. What this allows you to do is that when you're writing your BPF program, you just declare this function as extern, and then you can actually easily call it from your BPF program, which means in this case you can crash the kernel from your BPF program. Sounds interesting. Is that any way useful? Well, quite yes, because you can crash the kernel in a controlled way. You can let your BPF program simulate some situation that can happen in production, then crash the kernel, which will obviously give you a crash dump, and you can then analyze that crash dump afterwards and maybe find some problem that you wouldn't be able to simulate otherwise. So this is one of the use cases for K-funks. Another concept that appeared in BPF recently are so-called reference pointers. One of the problems is that BPF programs work with pointers is complicated, because the verifier had to check every time that the reference to get a point, you can only access memory, which is available to you, and there have been a lot of problems with accessing memory from BPF programs. One of the ways to approach it are so-called reference pointers, which are actually implemented using the K-funks, that's why I started with them, and that basically allow K-funks or functions from the kernel that the BPF program is calling to return you a pointer, which you can then de-reference and use it to access memory. There are two kinds of these functions. One are annotated with the acquire annotation, which says that this K-funks is returning a reference pointer, and then the other one is tagged with the release annotation, which says that this K-funks is releasing a reference pointer. These work roughly the same as the references in hand languages, so the verifier or there's a component which is counting the references to the pointers, and you can be sure that the pointer is not freed or the pointed memory is not freed during the time you have acquired this pointer. If you didn't have this, every access to memory from a BPF program had to be done through a special BPF call, which would register an exception handler because the memory could have been freed and so on. This mechanism allows a much easier and straightforward access to memory by actually holding a reference to the pointed memory, which prevents the memory to be freed in the meantime while you hold this reference. So as I said, the verifier is checking that this reference is always valid. But one of the problems that was created by this is what if you want to acquire a pointer and then you want to use it from a different BPF program? Is that even possible? Luckily it is, by yet another concept of so-called long-lived kernel pointers which are new kind of BPF pointers which have several features such as they must be strongly typed, they are returned by the K-funks or by the helpers, and that may be stored inside BPF maps, which is the most important feature, because then you can acquire a pointer, you can store it inside a map, another BPF program can pick up the pointer from the map and the reference it and use that memory and you can be still sure that the memory is not freed in the meantime. Actually there are two kinds of long-lived pointers. There are the reference ones, which are more important or interesting, but there are also unreferenced which still can be only accessed with property. So this is basically plain pointers without reference counting which you can just store into the maps, but you have to still use this special pro-breed call to access the memory. However, you can also use the reference pointer inside maps. They can be safely referenced without pro-breed and they are automatically destroyed once the map is freed, which is the nice part. So there's sort of automatic garbage cleaning. You have a pointer, you acquire it, you pass it to a different BPF program through a map, then this program works with it, then it doesn't free, it doesn't matter, because once the map holding the pointer is freed, the pointer will be automatically destroyed. Okay, as I said, there will be a lot of technicalities in this and this will quite peg, so sorry if I'm going too fast. Anyway, another concept that has always been trouble some in BPF is iteration. As I said at the beginning, one of the things that BPF verifiers checking is that your BPF programs do not hang. They must always terminate. This is a problem. It's been approached in a different ways throughout history. First of all, we basically said there can be no loops in BPF programs, which works, it's efficient, however it's quite too constraining as you can imagine. The second approach was we allowed to unroll loops by the compiler, which works again, but it's quite impractical. So another thing that came up was that BPF subsystem allowed fixed iteration loops. So loops that had a known number of iterations, but still they were quite difficult to verify because the verifier had to walk every path through that loop, every program path through that loop, and check that indeed the number of iterations is still fixed. So one of the most recent things that appeared in BPF is this BPF loop helper which resolved this problem because it's a new helper which has this annotation, which basically you pass it a number of iterations and you pass it a callback function, and it will execute this callback function the number of iteration times for you, which means that this is something that is very, very easy to verify because the execution is not inside the BPF program, the execution is handled by the BPF subsystem itself. So if you can verify that the callback function terminates, then automatically your loop terminates. And this is a very elegant and much simple to verify way of writing loops in today's BPF programs. Another thing that is problematic is that sometimes you want to iterate things that don't have a known number of items, but you know that they are finite. For instance, you want to execute some BPF program for every process running in your system. You don't know how many processes are there. What you know is that there's a finite number of processes and this will terminate. To resolve this, there is this concept of BPF iterators, it had been around for a while, which are some special probes which are not attached to events, but they are executed for each kernel object of certain time, for instance for each task, for each file, for each VMA, etc. What is a new thing about BPF iterators are so-called generic iterators which allow you to very easily add new iterable objects by a concept which is very similar to iterators in, for instance, C++, where you just specify these four things. First of all, you define a structure which will hold the iterator state and then you just define three functions, one for creating the iterator, one for getting the next item in the iterator and one for destroying the iterator once your iteration is done. And this way, just by registering these four functions, you can create a new iterable object inside the kernel. Quite an elegant thing. So, last thing in my part is this multi-link attachment feature that has appeared in the past year or two. One of the problems of BPF programs is that they sometimes take time to attach, especially if you're trying to attach one BPF program to many events, such as to all the syscalls. There are some 300-something syscalls on standard machines and if you want to attach one BPF program to happen any time a syscall is hit, it can take some time. So, there is this new link type BPF trace kpro multi and this concept is implemented using fprobes, which is a bit similar to kprobes, if you know kprobes in the kernel, but it's built on top of ftrace. The difference from kprobes is that it's available only for function entries and exits, so you cannot attach to arbitrary instructions, but it allows you to very quickly attach to multiple functions. I have an example here where I'm using the BPF trace tool to attach to all the syscalls. You can see some 400 to 426 probes and while it took some 15 seconds to attach in the past version, with these new probes it takes less than a second to attach, so a major speed-up for tools or programs that you want to attach to many probes. Okay, that's everything from my part. I have to pass the mic, so if you have one quick... We can ask one quick question, why we'll do it. Does it work? Yeah. Okay, so now let's talk a bit about BPF inner workings. First, I'll talk about memory management and there were a couple of interesting developments here. First is BPF-specific memory allocator and located objects in linked lists that it enables, and another one is BPF broke-back allocator. BPF-specific memory allocator was introduced by Alexei Srevoitov and it's used for dynamic allocation of memory in BPF programs. Obviously, there are already a number of memory allocators in kernel, but none suited BPF programs well, and that is because memory allocation in kernel depends heavily on the context it's ran from, and BPF programs, especially the tracing ones, come from any context including NMIs. So it's a common problem with memory allocation, and there are common ways to deal with it. One is known as memory pools, and the idea is you pre-cache some memory in a non-restrictive context, and then use it when the time comes. So BPF memory allocator does exactly that. It creates optionally per-CPU caches of objects of predefined size and manages them through IRQ work, which is a relaxed context. One of the issues with memory pools is overextending memory, using more memory than is needed at the time, so Alexei tried to remedy this with high and low water marks to keep the number of cached objects low. The interface is pretty simple. You get basically two pairs, one to work with variable size objects and another one with predefined size, and as you might have seen, this needs a reference to struct BPF memory alloc, and you need to initialize it in advance and destroy it when you're done. You can define whether you want it to be per-CPU cache or not and the size of the objects. There are some real-life applications already. The page that introduced this allocator also switched dynamic hash map implementation to use it, and it claims 10 times faster dynamic allocations now. It also allows sleeping and tracing BPF programs to use dynamic hash maps, which wasn't possible before. A bit later, another page said switched BPF local storage to use this new allocator, and that fixed deadlock problem when BPF local storage was used from tracing programs. It was caused by lock contention in K-Malek, and one of the most important things that it enabled is allocated objects and linked lists. So just a couple of months after BPF allocator was introduced, another page that was posted, and basically that allowed introduced allocated objects and linked lists. The former allow BPF programs to allocate their own objects of a type in the program BTF, and basically enable them to build complex data structures flexibly. The former introduced linked lists that are single ownership, and they can be put into maps or dislocated objects and hold such objects as elements. As is usual with BPF, everything is verified checked, so they're supposed to be safe at least from the verified standpoint. Unfortunately, interfaces for these are currently experimental, so I'm not going to show any, but if you want to head start, you can look, for example, in self-test BPF directory, and also in that directory you can find the BPF experimental that H header file, which is kind of a staging ground for the scene development APIs. Next is BPF rock pack allocator, unfortunately due to time constraints, I had to cut it almost completely, but the good news is that it's very well covered by WN, and if it sounds interesting, I encourage you to go through these articles. They're very well written. The idea of BPF rock pack allocator is to first save some memory, and that is achieved by packing multiple BPF programs into a single page. Before this, every single BPF program was using a whole memory page. On x86, memory page is 4 kilobytes, and that is way more than usually BPF program is. It also tried to improve performance by using huge pages and easy enough to be pressure, but due to bugs found in this patch set and also in previous patches that this patch set uncovered, it is not the case. As of now, BPF rock pack allocator is merchant working, but it only packs BPF programs into a single normal page. This also inspired some work on generic executable memory allocators, and the first attempt wasn't that successful because it wasn't available on all architectures, and another important user executable memory in kernel was kernel modules, so it couldn't be used for that, so that's why it was decided not to merge it, but just a couple of weeks before this talk, another attempt was posted. It is called git slash text allocator, and it is currently under discussion. There are obviously some concerns, but it looks much better. Another topic is BPF program signing. This is a long list of small improvements, and it started a long time ago. The first patch set was posted on April 2022, and we're still not there. BPF programs might look similar to kernel modules at first glance. They're both stored on disk as L files. They both require relocations. They both require memory allocation and so on, but there is one important difference, and that is in case of kernel modules, kernel does all the work, it understands the structure completely, it does everything, and in case of BPF, LibBPF does a lot. So by the time the code gets into memory, it might be very different from what was on disk that's invalidating the signature. So to achieve a BPF program signing, kernel needs to do more, and there were a couple of approaches to this. Most of these were discovered, first two. First was trying to move the whole LibBPF into the kernel, and that didn't work because it's big and unwieldy. Then there was an idea to implement a new file format that would be understood by the kernel, which was dropped in favor of new BPF program type, which probably was easier to get into mainline. So to understand what kernel needs to do, we need to understand what LibBPF does. Its processes can be split into four main phases, but only the first two are important in terms of program signing. The open phase is where the object file is parsed, and where LibBPF learns about programs maps, external functions and so on. And the second phase is where the code changes actually occur here, LibBPF probes kernel features, applies relocations, creates maps, and so on. Everything needs to be done by the kernel. One of the places where code changes occur is map creation. Before this change, BPF programs accessed maps through map file descriptors, but those could only be determined when the maps are created. So instead of referencing them directly, another abstraction was added, which is file descriptor arrays. So now BPF programs reference indexes in this arrays, and the arrays are populated during program load. Another introduction is loader programs. So you might know that everything BPF-related in kernel is done through a single CBPF syscall. The first argument is a command ID, and there's like 32 of them. So everything map creation, program loading, attaching, BPF lookups, everything is done through this syscall, and the net result of LibBPF is a list of the syscalls. The idea was to write those down and play them back when the time comes, and to do that, a new BPF program type was introduced, which could only call the sysBPF syscall and also sysclose, and could only be ran from user context. But it's still a BPF program, and we still need to load it. That's where light skeletons come in. The normal workflow with LibBPF is that you create some BPF programs, put it in a C file, compile it, get an object file. That object file then gets parsed by BPF tool, and you get a skeleton header file. That skeleton header file contains the whole regional elf object file, but it also contains structures describing maps and programs and so on, and it also contains multiple functions to work with those, to load, attach, and so on. And then you take that header file, include it in your user space program, and work with it through this. The light skeleton is different in that it does not contain the original file. Instead, it includes the loader program. That also means that we don't need LibBPF or Libelf headers anymore, so your user space program would not depend on those. For your user space program, the change is almost invisible, because they are interchangeable. The only thing is you cannot use LibBPF functions anymore, and in most cases that means that you need to change the way you access map and program file descriptors, because usually they access LibBPF functions, now you accept them through the skeleton structure. All of this was just one single page set, but it still missed a couple of important things. One is core support. It is very important because it allows for greater BPF portability. In this case, the full source file that implemented it was changed so that it could be compiled for both kernel and user space, and this allowed to get more BPF programs through light skeleton and allowed to remove the BPF dependency from BPF preload UMD. BPF preload is a number of BPF programs that are bundled with the kernel. This also enables languages such as Go to use the full advantages of core web before that they couldn't because they can't adopt LibBPF for whatever reason. The last thing is light skeleton in the kernel, so instead of including light skeleton in a user space program, you can include it in kernel now, and this allowed to drop user mode driver from BPF preload completely, and that means now we have BPF code in a kernel module that can be signed, so we get BPF signed code. Well, only kind of. Because it's a module, it lacks portability, and so we are still away from true BPF signing, but I hope next time we talk about this, it will be there. That's it from my part of the questions. Well, kernel documentation, I think. Kernel documentation, actually one of the important parts that we missed. The question was what is the best source of information about BPF. Important parts we missed. During this last year, a lot of commits were devoted to documentation, so documentation directory in kernel is very good now. Yeah, so there are also books on this by the authors. Yeah, the latest greatest is kernel source string, and self-test directory. Can you repeat the question? What's the difference between K-probs and K-functions? Yeah, so you're speaking of a BPF trace tool specifically, right? So the K-probs are the standard, they depend on which kernel mechanism they use to attach to the functions. K-probs use the K-probs mechanism, which has been in kernel for a while. While the K-funcs, which are unfortunately named the same as the K-funcs in kernel, but they are different things, so the K-funcs in BPF trace map to f-entry, f-exit, probe types or program types in BPF, which are special BPF special probes that allow you to attach to the beginning or to the end of a function. They are much quicker than K-probs. They have some more advantages, such as access to function arguments and so on, because they leverage on the BPF type information, BPF. Yeah, so the question was the BPF loop helper introduces an indirect call, which is generally viewed as not a very good thing in the kernel because it can introduce many problems. And the question was if BPF is doing something to mitigate those, if I get it correctly. Yeah, specifically for the performance perspective. To be honest, I don't know. I don't have that much insight into this, to know. I guess yes. I expected that they are employing some mechanisms, but I don't know of any particular one, sorry. Yeah, so the question is if there are any limits to stack depth in BPF programs. Am I correct? There is. So one of the things is that the BPF stack is quite limited. It's like 512 bytes in total. And I don't know the exact limit, but there is a limit on the stack depth that you can have. It's checked by the verifier. The limits, in general, are quite strict in the verifier. Often more strict than developers would want them to be. So the question is, since BPF acts sort of like a virtual machine inside a kernel, what is the overhead of executing BPF programs? And specifically with tracing, what is the difference using BPF programs for tracing than kernel tracing? Basically, it's called a VM, and the programs are called that they are just in time compiled by... But basically, the translation from BPF instructions into native instructions is usually one to one, I would say. So the overhead is just basically looking into a table which tells you which instruction to convert it to for the specific architecture. So the overhead is very, very close to zero. Okay, and we're out of time. So thank you for the attention, thank you for coming. If you have any other questions, just feel free to grab us on the corridors and discuss more. Thank you.