 Welcome everyone. My name is Alexei Staravoytov. I work for Facebook. Together with Daniel Borkman from Cavalent, I maintain BPF subsystem in the Linux kernel. By the show of hands, how many people here have never heard of BPF and Jill today? Have not, have not heard. Okay, this presentation for you as well. No, seriously. This is for everyone. I will go from the past, how it began quickly into what shape BPF takes today and we'll give a glimpse of where we're going this year and in the future. First of all, it's the most important to understand what are the goals and non-goals for BPF as a whole. So first of all, it's a way to, first of all, safely and easily modify kernel behavior. Safety was in the BPF from the start. It's kind of baked in in the architecture and I think we did a pretty good job keeping the safety in check. There were, of course, two bugs over the years, but I think it actually worked pretty well. The easily part on the other side didn't work quite so well. So BPF is hard to use is number one complaint we hear all the time and that's what we're trying to address pretty much non-stop. Non-goals are equally important. When we just started pushing patches into the kernel back in four-plus years ago, people were saying, oh, you're trying to do de-trace in the kernel. That's what you're about. You introduce your dynamic tracing into the kernel. If you're doing so, why don't you use a different virtual machine like de-trace using Celery? So KTAB did. Then networking folks saying, oh, no, the BPF is actually there to replace a VS or it is to replace the IP table. So this constant fight for the turf was from the day one because people misunderstood what BPF is trying to achieve. Yes, you can do dynamic tracing. Yes, you can do just as full kernel introspection with the BPF, but that's a non-goal for the architecture and for instruction set as a whole. To look back, how we started. So long days ago when BPF was introduced, the classic one, by one Jacobson, it was pretty cool back then, but it had only two shapes. Whether you can use it as a truck or as a robot. A truck would be something that was filtering the packets through the pickup interface, and the robot was the second filtering. Both are cool and second was still heavily used, but only of two forms. That was the underlying restriction of architecture, like speaking, speaking for themselves. The classic BPF was designed with a packet filter in mind. It was instruction set for packet filtering, whereas the extended one came as a generic infrastructure to extend the kernel. That's a fundamental difference. How we extended it. The development began late 2011. The first version was just in USA, instruction set architecture, and a little-known fact that First Verifier was operating on reduced x86 instruction set. We hacked GCC to reduce the number of things x86 can generate, but that didn't go quite well, because for various reasons. At the end, we just said, no, screw that. That doesn't not go into work since it's x86 really only, and we would need to run on all architectures. So we reused only encoding from the classic BPF. Why didn't the register did some other modification? The end started to look like classic instruction set, yet it's still vastly different. At the end, it looks more similar to x86 than to anything else, with input from all other architectures that are taken into the account, where x86 ASA and ARM64 ASA were weighted the most in terms of influence to the BPF instruction set, but quirks of x86 were removed, like 16 and 8-bit sub registers. BPF has only 32 to better match with ARM64, and certain things were removed as well. We had the GCC back end. It's still partially alive, and some folks were interested in upstreaming it, but it never happened because of lack of the assembler. Whereas the VM had an integrated assembler, so that's why the whole thing was upstreamed in the VM since 2015, went through quite a few versions and it changes, and the first version of the BPF in the kernel ended in 2014. So just reflecting back to it, that was four years, and when it happened, when BPF was sort of unleashed to the world, this is how it looked. Lots of new things, new toy, can probably do cool things with it, but there is no manual. People start creating different things, and lots of really real, I would say, rocket ships were built. At the Facebook, we've built Catron, Droplets, VVND, and a bunch of other things. I believe so far today, only Catron is open sourced, and outside, in the public repository, the tracing tools, a bunch of VC tools, BPF trace that's still in very, very active development, system time, BPF, and many others. On the networking side, Selium is probably the biggest open source project that leveraged in BPF in the networking space, system decan tracing site, and many others. Because of this green roots type of the approach, all of the ships that were built, they kind of look the same. Once people figure out, let's say, Catron, how to use XP, in an efficient way, everyone else started to copy it, and all the solutions, like a droplet, they somewhat similar to Catron. If you haven't heard of Catron, so this is Facebook production load balancer. It's key advantage versus kernel bypass solutions that it's leveraging kernel stack at the same time. So, on the same host, this is a picture on the right, we're on the SAP internal packet processor that operates at the line speed, and the same host using standard cleanness networking stack with, well, they call back-end application that do in all sorts of other stuff. So, it's the all cores are equally loaded, and this is not something you can do when you're completely bypass the stack, unless you start doing all sorts of extra hope between kernel, user space, back and forth. So, as I was saying before, the main complaint for VPF today that it's hard to use. So, I keep asking the question myself, like, if it's hard, why would I use it? I think the answer to me is to build, to implement cool ideas that people have in mind, just user space solution is not enough. Kernel and user space need to work together to come up, to implement these great ideas. This boundary of user space and control space, it's only, well, it's only in the people mind. So, what VPF made people realize that now we can blend this boundary, that the solution can be in both kernel space and user space. What was happening before when people thought that kernel cannot accommodate them, they either would develop a kernel module or completely bypass it. So, bypass solutions is DPDK, obviously, from Intel and SPDK, then the CLA-DB and CSTAR technologies, then there is a DPD and ARM, the SNAP switch, VP from Cisco, then Google is doing the whole stuff and so on. Why? I think the answer is kernel is fundamentally hard to extend. So, that's why I strongly believe that kernel, the Linux kernel, needs VPF to stay relevant. How the programs look today? This is all the little helpers that one program had to figure out how to use, glue them together and create this nice program that we can all use. These programs are still loop-free. So, this is a request that keep coming to us that we're going to address and lock-free. Lock-free meaning that there is no locking allowed. By design, the safety comes with the cost. We cannot just say, well, run everything in there. If we start, if we allow the loops from the very beginning, it's easy to make a mistake and create an endless loop. Kernel will hang and it would have no users. It would be no different from kernel modules. So, as I was saying, the goal is safety first, then easy to use later. Easy to use, still working on it, but safety, that's why there are no loops, that's why there is no locking. And hooks, we can attach it. This is a hierarchy of the different hooks that today present in the Linux kernel. Most of the tracing hooks here are read-only in terms of what they can do with the kernel. Recently, we added error injection facility, which I think is pretty cool. So, from the tracing hooks, we can modify the kernel with error and inject fake errors to check all the facilities. Networking hook operate at different layers, like XDP is operating at the lowest raw packet buffers before the kernel stack, and sometimes the drivers can do anything with it. And one of the use cases, let's say, for the XDP in some companies, what people call a big red button. When bad things are happening in the kernel, zero-day vulnerability is found. The zero-day BPF prevention filtering program can be deployed quickly within hours across the fleet that will stop this attack. TC layer is one above, lightweight tunneling is yet another layer above. Reuse port is the newest addition, where we can, in a smart way, do the load balancing across different sockets. Flow of the sector is similar security enforcement mechanism. But C-group based hooks made the biggest impact to the kernel and to BPF that no one realized, especially Daniel, when he implemented the first version of it. Now it's the fastest growing set of hooks, I would say. Initially, it was just layer 3 hooks for ingress in ingress. Then we added bind connect, eGP send message, so I could create the device controlling hooks. Sock map is a special one that's not well-known, but if you hopefully saw the presentation from Thomas Graf earlier today, Sock map is a hook that's used by Kavellian to implement really, really fast socket redirect between different applications at layer 7. And another set is TCB BPF that are somewhat different from the other hooks. They act on at runtime instead of compile time. And with them, we can fine-tune, their main purpose is to fine-tune congestion algorithms, TCB congestion algorithms for cases that congestion algorithm cannot express. For example, in a data center, you would have a different mean RTT if your destination server would interact, or if it's in a different data center. In TCB, you cannot really express, so if you're a big company like Google, you can have proprietary patches that understand how IP addresses are created and use that, or you can use TCB BPF to give this extra knowledge to the kernel how to fine-tune the TCB stack for the most optimal throughput. So these are my details. I guess I will skip all of this. The helpers, not that interesting. There's a lot of it. A lot of it, and the stuff is keep being added. Yeah, just a lot. So the one that's still probably the biggest obstacle and that some of the BPF users hate the most is a verifier. What they say is to write a program, they need to constantly fight verifier to let the program be accepted. So that's unfortunately exactly the case today. But what happened recently that we only realized in today, about six months ago, we introduced BPF to BPF calls when within one program we can have different sub-functions and create JIT dysfunctions as independent kernel function with arbitrary arguments and arbitrary return value, and we were able to teach verifier to understand this call graph of different programs with arbitrary pointers that being passed between these different functions. So just imagine from before, we had programs with a fixed context that was like a SCB pointer and return like single integer, and that was a scope of verification. From this, we went to arbitrary graph of functions with arbitrary pointers passed between each other. When we started implementing verifier, it just seemed impossible. This is non-tractable problem, but we made it, and we still believe that it is safe. So this changed the mind of the developers. So what we're going right now, Joe from Kavelin now adding the pointer tracking, let's say, primitives to the verifier, so we'll be able to do unthinkable things before. The first case is to return a socket by the helper, so both from XDP and the TC layer on the networking stack, we can look up a socket that is controlled by the bigger networking stack, return it back to the program, and the verifier will make sure that the program will release it back to the current because of reference counting, and it will make sure that this pointer does not leak, that we don't accidentally return and so on. So it analyzes the both control flow and the data flow of the program, tracking all the pointers, and making sure that the program is valid from this data flow perspective. I would say some of the standalone static analysis tools do this as well. Most of the compilers do not because they don't care. It doesn't matter from optimization performance standpoint of you. Static tools do, but doing this in the kernel, that's pretty revolutionary. It's very close to land. I hope it would be landing today, but hopefully it will land a few days from now, so the patch is practically ready. What it will allow us to do is to introduce malloc functionality, to allocate and free object from the VPF program, and make sure that we don't leak memory. Just imagine how cool is that to have a safe memory allocation, lock on lock. So before we couldn't do spin lock, all the programs in one non-preemptable section in one RCU section. Preempt disabled and RCU lock is done outside of the program because we couldn't track this. Now with this functionality, with these smart bits that are being added to the verifier, we can reduce the RCU section, we can do RCU inside the program, we can do spin locks and whatever else. So this is huge, but more coming. Loops are the hardest to do, and technically I believe the terminating halting problem is indecidable. We're going to solve it too, with conditions of course. There are two proposals from Cavalent and Solar Flare folks. We've been baking them for about a year now, trying to decide which is the long-term going to work. Both actually have the pluses and minuses, and not in the shape yet to merge upstream, but the work is continuing on it, and everyone is pushing hard. So hopefully next year, when we have the next systems go, we'll come and say, look, you can do loops now. But that's not it again. Well, we discussed the indirect calls, so basically the tail call stuff that... Well, anyway, so there's so much stuff that's going on right now. This is just a short list of the features that different companies and different people are working on. All of them, I think, equally cool. Different use cases for each. This is coming. At the end, today, we have this crutch inside the verifier that's only allowing 128 instructions in one program, and the program limited to 4,000. Why was it the case? We couldn't do anything better. We just did 4k because that was the limit for the classic, and 128 because we increased it like five times. That's how it was. So now we're getting bold, and the next target is to get to the one million instructions. So the programs will not be small, log-free, loop-free programs, but something big that the real Eurubisq algorithm will be expressible in it. I do mean it. That will happen. But that's not it. Still more and more to go. Another pain point is of use, as I said. BPF gets its performance from JIT. As folks who try to debug Java or Node.js know that doing performance analysis for JIT languages is pain in the butt. In kernel, with BPF, it's even harder because JIT is done by the kernel, not by the user space. If it was user space, the user space can tell the performance to the association between JIT code and original when the kernel of this association is lost. What kernel runs pretty much has nothing to do with the code that was written initially. So we need to bridge this gap and keep association all the way from source code, through the instructions, through all the optimizations that the kernel does to the JITs, everything the JIT does, and tie it all together, return it back so we can finally do proper performance analysis and improve the debugability. So type information is coming, source line information, function prototypes, and many other things. So type information we're doing through what we call a BPF. It stands for BPF type format, pieces of it already landed. The most interesting bit will be when we start converting VM Linux and embedding BPF time-pin for inside the VM Linux. It will add an extra step. Currently, dwarf for VM Linux is about 300 megabytes. The BPF is currently 10, so it's a 30x reduction, but still too high. Our target is reduced down to one megabyte only, but this compression currently takes five minutes, not acceptable. So once we get to seconds runtime and one megabyte, then we'll get it upstream. C-group local storage is another, we're still in ideas obviously from user space. User space had thread local storage for years. Why not to do it in the kernel? So our first local storage is C-group local storage that Roman implemented. We have two flavors, just a regular one and per CPU. What it does, same is in user space. It avoids extra look-ups and adds performance. And in kernel, we obviously care about 00 overhead. And with this, I'm out of time. So to recap what we've talked about, BPF in the future will be safe and will be easy to write. Thank you. Please ask questions. Yeah, I will. Safety. In some fields, safety means real-time and determinism. And considering also the JIT step, is there some idea of how to evaluate the determinism, the real-time properties of these programs? So the JIT part is done during the load. So JIT is probably wrong word and it always rubs me in the wrong place when I say JIT. Because JIT to me is a compiler engineer by background, it's just in time. So BPF is not doing just in time, actually. It's doing, it's converting BPF instructions into native assembler at a load time. So it doesn't add to the run time cost at all. So whatever time it takes to verify the program and JIT is done in user context. And charged by the time is charged to the process that they're doing the load. So the only real-time cost that can affect the real-time is the final execution of the program. And the final execution is currently limited by this 4,000 instructions. If it's used in the wrong place, yes, it will start aiding up. Like if we're constantly doing, I don't know, 20 hash table lookups for every packet, yes, the networking processing will be slower. So this unavoidable to some degree. We have few ideas how to automatically mitigate even such scenarios. But so far it's too many cool ideas to do and not enough people to implement them. But this is definitely on to the list. Thanks. So with BPF to BPF calls, you introduced the possibility of indefinite recursion. So do you aim to detect this by analyzing the call graph and not allowing loops? Or do you just limit the stack size? Yes, it's both. It's checking for recursions that calls don't call back into each other. And checking the maximum call stack that each function takes. So the current, so initial stack size limits was 512 bytes is still the same even for multiple function. But majority of the function, they don't need 512 bytes. So they just call each other until it's pretty just 512. So there are a bunch of this, well, some degree artificial limits. Like the nested number of calls currently is eight, I think. But this number just picked arbitrary. Like why not? Why eight? Just because. Why 512 bytes? Well, kernel is probably okay if we consume 512 bytes, especially some functions assuming 2 kilobytes. And the static analyzer is able to do this or is that a runtime failure then? That's static analysis. Like all of this done by static analysis by the verifier. The only runtime check we currently have is a tail call limit. Tail call actually can be recursive. And tail calls limits it to 32. 32, yes. So for all of these new things that you mentioned, what's the best? I'm unmuted. I'm speaking into the microphone. Okay, cool. What's the best way to keep up with all of the cool stuff that's happening in BPF world? Like except bug you maybe? Obviously, at the end, all systems go. Both seriously. And it's an excellent question. Right now, we're spinning up a website on Facebook. And there will be a blog there. So we will try to keep track of posting all the latest news there. And following the networking mailing list, I guess. But it's an excellent question. Iovizer, many things. Yes, we have bi-weekly calls where we discuss all of the stuff. It's public, public. Anyone, anyone, three, two, 10. Yeah, it's pretty much BPF call. We still call it Iovizer. But yeah. It's not recorded. No. The light is off. Do you see a case for a set of common BPF programs that live perhaps in the kernel source code or something very close to it as opposed to passing them around on GitHub from project to project as we discuss bugs in our TCP sectors? Yes, that's what I sort of briefly, very briefly skipped here. Libraries. So that's exactly what it is. We want to push some of the stuff that just lives in the kernel and common primitives will just be there. One question like the Solaris tracing stuff, it has its own language. BPF is currently only machine code kind of thing. And now you have this scheme that you can use LVM to also generate it. In that case, most you would write the programs in something that resembles C. But do you have any idea where you want to go with this? Do you expect that there is a somewhat accepted standard language for writing BPF rules later that is C or is it not C? Or that you want to do any kind of toolset? Or do you expect that this is up for certain party people who want to write their own tools with their own languages? What's the plan? Excellent. So I love this question. There is so much in it to me. Because the philosophy of the BPF is to be the construction set and let everyone build on top, including the languages. Why I strongly push C from the very beginning because of ease of use. Everyone knows C, so let's make C work as well as it could. At the same time, we're trying to enable everyone as well. BPF Trace is a project that has pretty much detrace-like language, like detrace-minus-minus, detrace syntax that doesn't generate any C, it generates LVM IR and feeds into LVM began directly. That's one project. Then there is another project called Ply, PLY. It also has detrace-like language, but it generates BPF instructions directly without any LVM. So it best suited for embedded environment. Then there is another project that takes Python and generates, looks at the Python internal representation, generates BPF codes without any LVM just based on Python IR. So all of this is blooming and growing and there will be new languages. We constantly have discussions that we need to hire a language designer to design a language for VVF. It may happen one day. The way it looks, the C one is definitely the one that's going to be supported by upstream and hence probably has an edge over the other thread. Yes. I'm biased, just a little bit. Okay. I have a question. It's me. So BPF to BPF calls and you said that they are now being verified by the verifier and you can do the call graph analysis also. So what if you give input like as a massive, as a program which generates like a massive call graph and if this analysis is done inside the kernel, how are you going to handle that? You mean the concern that it's somehow will take the amount of time? Yes. So it has the limit of, that's this crunch, what I call the limit of 192 instructions. So even if the whole program is like 4,000 instructions long, if it sees that this complexity is exploding, it will reach 192 and it will say give up. So this is a course way to prevent the ugly programs from consuming too much CPU time during verification. So when I tried to use BPF for like debugging, one challenge that I experienced is that because probably because I'm not used to it, is that when the program compiles and it gets unloaded and kernel rejects, the verifier rejects it. But it was kind of really difficult to what the verifier says to my code. Are there like plans to improve in that area? Yes. So ease of use is understood. It's not easy today. And yes, that people constantly fight verifier. And that's exactly, yes. So today from the code, whether it's C or D-trace language or whatever else, when the verifier says, oh, this instructions you load in from an uninitialized memory, most of the time users have no idea why this assembler instruction, where it corresponds to. So this step, type information and source life in line information that we currently very actively working on is exactly to address that. But first, we thought we would just do line number information the way that typical user space application do. We decided to go a step further and we'll just like push pieces of the source in the kernel. So there the verifier can say like this piece of original code is misbehaving. Look for the bug there. Okay. I'm guessing. Okay. One more question and we're out of time. Hello. What's the most difficult thing to do as a maintainer of BPF? The most difficult thing is to say no. When there are so many cool things are happening and the most difficult is to figure out the balance of the, to not get too much into specific niche that doesn't, that will prevent the core from growing faster. It's, I can give like more specific example, but that's, that's to me is the hardest. Thank you.