 Hello everyone! I hope you had a wonderful coffee break on our main stage and you're all refreshed for the next talk, which will be presented by Viktor Malik and will be about Linux Tracing made simpler with BPF Trace. So let's go Viktor and please enjoy the talk and if you have any questions, leave it in the Q&A tab. Okay, thank you. Thank you for the introduction. Let me share the slides. So hopefully you can hear me and you can see the slides. Okay, so yeah, that's great. Thanks for the answers in chat. Okay, so hey everyone. I'm Viktor and I'm a software engineer at Red Hat, where I'm working mostly on stuff related to BPF technology in the Linux kernel. And one of the tools that I'm particularly working on is called BPF Trace and that's what I want to talk about today. So I will briefly present what BPF Trace is, what you can do with it, what it allows you to do. And then since BPF Trace was already presented here, by the way, some two or three years ago, I will concentrate on the new stuff that has been added in the recent years so that you know what's new in this world. Okay, so let's start. What is BPF Trace? If I had to define with one sentence, I'd say that it's a BPF powered tracing tool for Linux. Now, what does this mean? So first of all, BPF Trace is a tool which allows you to do live tracing of Linux kernel and also of user space binaries running on your Linux system, which means that live tracing means that you don't have to modify the kernel in any way to do the tracing and you can do tracing of the kernel that is actually running without the need to stop it, without the need to debug it. You just start BPF Trace and it allows you to live trace whatever is running on your system. One of the great advantages of BPF Trace is that it comes with its own high-level tracing language, which allows you to define what you want to trace, how you want to trace it, and so on. And as the name suggests, it leverages the BPF technology in the Linux kernel. I will briefly speak about BPF technology in a while, so don't worry if you're familiar with it. A great advantage of BPF Trace is that the tracing is very fast. And I mean that in both in terms of writing the program itself, because the language that it provides is quite simple and very expressive, and also in terms of execution, thanks to the usage of BPF technology. Some of you may know some other tracing programs or tracing tools such as system tab, Ftrace, or for instance, very famous Dtrace for Solaris. So BPF Trace offers a very similar functionality to these. However, it has some quite remarkable advantages over these tools. Okay, so let's get into how does it work. And before I will go into internals of BPF Trace, let's see an example. So let's say you have a running system and you want to collect the number of syscalls each process in your system is doing. Using BPF Trace, it's as simple as writing this simple script. I will explain what parts of this script means, so now let's just leave it as it is. And the output can look like this. It's shortened, but you can see that it prints us for each process on our system. It brings us the number of syscalls that it did during the time that this BPF Trace script was running. This was running for some one second and, for instance, pipe wire made nearly 500 syscalls during the time. Okay, so how does it work? Let's start with what is BPF because that's the underlying technology under it. So BPF stands for or used to stand for extended Berkeley packet filter. However, since it's not about packet filtering anymore, or not only about packet filtering anymore, it's today it's really called BPF and no one is using the actual meaning of the abbreviation before. And also the E is now omitted because the traditional old BPF is not used anymore. So what is it? Basically, it's an internal virtual machine which allows you to run custom programs inside a running kernel. Now this may sound scary, and it is. So there are some restrictions on the programs. First of all, these programs are written using so-called BPF instructions, which are low-level instructions that operate over some set of 12 registers. So you have arithmetic operations, memory operations, and so on. And these instructions are then just in time compiled to machine instructions once the BPF program is run. But since this is running in a live kernel and that could cause, of course, many problems, there is a component called BPF verifier, which makes sure that BPF programs are safe to be executed. For instance, it checks that the program cannot hang the kernel, so it has to terminate. It checks whether it cannot corrupt the kernel or crash it. So it checks if the memory accesses are valid, if there are no null pointer differences, and so on. So BPF trace builds on BPF and what it can actually do. So BPF trace does basically two main things. First of all, it provides a very high-level C-like language, which allows you to define what you want to trace, and BPF trace takes this script that you've write and translates it into a BPF program into BPF instructions. So this is the second part of our script highlighted in red, which says that we want to create a global map, which will be indexed by the name of a process, and for each process it will contain the number of C-scalls that were called. And second thing, BPF trace can take your BPF program and attach it to various events in kernel. So for instance, we attach this program to a special place called TracePoint, which is a hit or executed every time any C-scall is entered. So this way, anytime this call is entered, our program is executed and it collects the numbers of these calls into a global map that is just printed at the end. As for the available events where you can hook, besides these special kernel trace points, which are static trace points built into the Linux kernel, you can attach to basically any function inside the kernel or inside the space binaries. You can attach either to function entry or function exit typically, but there are ways to attach basically to any instruction inside a function. Then there are very interesting hardware and software events, for instance, cache misses and so on. For instance, we have memory watch points, which means that you can attach to a single memory location and your probe fires anytime that memory location is accessed for read or write. And also we have some time-based events, for instance, for example, you can run the BPF trace script any or every, I don't know, 10 milliseconds or so. So these are basically what BPF trace can do. Before going more into detail, I would like to show you some examples of some simple BPF trace scripts that we'll demonstrate through BPF trace capabilities. So first of all, let's say we want to collect the number of bytes each process in our system is reading. For this, we can use this simple BPF trace script, which we will attach. The first line says that we want to attach to the exit of the read syscall. The second line says that we want to filter only those syscalls that return a positive number, which means that they did not fail with some error, that they actually read some number of bytes. And the third line says that we want to build a global map, again, indexed by the name of the current process, which will contain for each process some of the return values of this trace of the syscall, which means that it's the sum of the bytes read by the syscall. And that particular process. So running this can give us something like this, we can see here, for instance, system B, during the time that this script was running read six kilobytes of memory. Another example, attaching to the same same event, so to the exit of the read syscall. Now, we are filtering by the name of the process running. So we are saying that we only want to trace syscalls done by our application called my app. And what we want to collect is we want to collect a histogram of the sizes that were read by this application. I can receive something like this, for instance, which shows that my application is doing quite a lot of one byte run reads. So perhaps that could be a nice place for optimization. So this way you can very easily obtain quite useful information, for instance, about the syscalls that your application is doing. Now, how does BPF trace work? I don't want to go much into detail, but some of you may be interested in how BPF trace works under the hood. So I have a very simple overview of its architecture. It might seem a bit complicated in this picture, but I will go step by step to explain. So first of all, we start with BPF trace program, such as the one that we have already seen. At the beginning, BPF trace parses this program using its built-in parser. And as is traditional with other parsers, for instance, in compilers, the result of parsing is an abstract syntax tree. Abstract syntax tree is basically a tree structure which contains, which represents the program, and it contains information such as types and so on. After that, there are some parses run on this, such as semantic analysis and so on. I don't have it here. But what is more important is that then BPF trace generates a so-called LLVM IR out of this abstract syntax tree. What is LLVM IR? LLVM IR is the intermediate representation of the Clank LLVM compiler. So basically, when Clank is compiling a C program, for instance, it will first compile it into its internal representation called LLVM IR, and then it will generate machine instructions out of it. So BPF trace is generating this for our BPF trace script. Why is it doing it? This is because Clank has a BPF backend, meaning that Clank can take your program written in, for instance, in LLVM IR and generate BPF instructions for you, so you don't have to write those manually. By the way, this is the standard way of writing BPF programs these days. So you've write a program in either in C or, for instance, in LLVM IR and you let Clank generate BPF bytecode for you. So next BPF bytecode is loaded into the BPF subsystem of the Linux kernel where it is verified and then it will be just in time compiled into machine instructions and it is attached to the event that you want to attach it to. For us, it will be the chosen trace point and then every time the trace point is hit, our BPF program will be executed inside the kernel. Okay, hopefully this was understandable and if you have any questions, if you want to know more, feel free to go to or ask me or feel free to go to the pages of BPF trace, there's quite a nice reference guide which explains a lot. Okay, as I promised, I want to give a bit insight into what is new in BPF trace in recent days. So let's again start with a simple example. One of the problems that BPF trace used to have is that the scripts used to be very complex. Even scripts doing some simple stuff can get quite complex. Let's do this. Let's write a script which will check if kernel is not losing any bytes sent over TCP. So how do we do this? We will attach to a kernel function called TCP send message, which has three parameters. First one is a socket which identified the socket to send the data to. Second one is the message to send and the third one is the size of the message and it returns an integer, which is the number of bytes that were actually sent. So what we want to do here, we want to compare the size with the return value, return it from this function and we want to see if those two are equal. And also to make it more interesting, let's say we also want to print the source IP address from which we are sending the data, we can extract this from the socket. The struct sock has this information. It's a bit deeper because struct sock contains the common and so common contains the address itself, but it's there and we can extract it. So traditionally, writing this in BPF trace, you would use so called K-Props. K-Props are quite an old mechanism in Linux kernel, which allow you to attach to any place in any function inside the kernel. Unfortunately, what is a disadvantage of K-Props is that their return variance so when you're attaching to the exit of a function, they don't have access to function arguments. So what we want to have to do here is we first have to write a K-Prop and entry K-Prop is the first one where we will store the first and the third argument. So the socket and the size because those we will need at the end of the function. So that means that we are storing them per thread ID so that we can identify when we are exiting in the same thread. And then the second part, the second K-Prop is the more important or K-Prop because it's hooked to the end of the TCP send message function. So that one will first calculate or create the IP address or IP source IP address, how it does it. It will take the socket buffer that we stored in at the beginning. It's the SKTID part. Then it has to typecast it to the appropriate type, struct socket, which we will pull from a header that you can see at the top. And then we will access the appropriate fields to get the actual address and we will send all these to a built-in function in BPF trace called N-top, which takes an integer and translates it into a string representing the IPV for address, for instance. Then we will print everything to the user using printf and then we will delete the entries in our maps so that, again, when we execute the function, there won't be some trash. Okay, so this is quite a long script, as you can see, for quite a simple thing. Luckily, BPF Trace has been recently updated with several new features, which allow us to simplify this in quite a good way. So first of all, instead of using K-Prop, we can use so-called BPF Tramplines. BPF Tramplines are something that has been added to Linux kernel and they allow to attach BPF programs to various places, especially to entries and exits of functions. One of the great advantages of BPF Tramplines is that they have their written variants, so the one that you attach to the end of the function has access to function arguments. So we can completely omit the first probe, the first K-Prop, the entry one, and we can also remove the delete parts, and we are only left with the return probe. Second, BPF Tramplines have access to something called BTF, which stands for BPF type information. And it's basically debugging information for BPF. This gives us access to, first of all, it gives us access to function arguments by name, so we can do arcs, arrow, SK, to access the socket, and also it gives us access to all data types that are there. So we don't have to include the header, and we don't have to do the actual typecast, everything can be pulled in automatically. So instead of doing the original long script, we can believe left with something as short as this, which is much more readable than it used to be before. So this was done using two technologies, BPF Tramplines and BTF. Let me briefly explain what those are. The BTF Tramplines called K-Funks in BPF Trace are a kernels way of calling BPF programs with practically zero overhead, so they are even more efficient than original K-Props because they have practically zero overhead. And these technologies are, as I already mentioned, their return functions have access to function arguments. Also, they have access to arguments by name, by leveraging the BTF type information. And if there are some kernel developers here, you may know these as F entry, F exit Tramplines from the Linux kernel. So this is the same thing basically. This is the BTF, which I have already mentioned, a BTF sense for BPF type information, and it's basically a metadata format to represent debugging information. It was originally mainly related to BPF stuff, but these days it contains practically everything that you need to debug a kernel, starting from data types, ending with function prototypes, including names of arguments and so on. It's usually generated from Dwarf debugging. So to interrupt you, you have two minutes left to the Q&A. Okay. Yep. Okay, we'll do. Okay, so BTF is generated using Bayhole, and its great advantage is that it's very compact instead of 200 megabytes of Dwarf can be represented by only four megabytes of BTF. So what main distributions these days do, they include BTF inside the kernel. And BTF gives us in BPF Trace access to all kernel types without the need to pull in any kernel headless. Now you may be asking, okay, that's BTF, that's information for kernel, but what about user space binaries. Recently, Dwarf support has been added to BPF Trace, so BPF Trace these days can also trace user space binaries, and it can leverage Dwarf if it is included in the binary. So you can, for instance, access Uprobe, Uprobes are a similar thing to Kprobes, but for user space, you can access Uprobe arguments by name. You can do automatic resolution of data types and many, many more features that are either under active development these days or will come in the future. One last feature that I want to mention is iterator probes. Yirka also already mentioned them today at some two hours ago at his talk. So basically the general idea is that BPF Verifier doesn't allow you to do loops with unknown number of iterations. However, you can iterate some data safely. For instance, you can iterate a list of tasks, which is always finite. So iterator probes allow you to do exactly that, iterate some kernel collections safely, and write a BPF program that will allow you, for instance, to iterate tasks as I have here on the picture. And it can, you can print, for instance, the name and the PID of the task. Okay, I will just, this is basically the last slide. I just want to mention two long term future projects that we are working on in BPF Trace. First one is ahead of time compiled BPF programs, which are basically, which will allow you to write your BPF Trace program, compile it into a binary form, and then distributed one. So there will be no dependence on LLVM on the final system. You don't need LLVM on the final system. It will be, of course, faster. And also, it will, for instance, allow signing of programs, which can be quite nice. Second part is we want to implement our custom BPF backend, which will completely replace the dependency on Clang and LLVM, and it will allow us to generate faster and better optimized BPF code. Okay, that's it. That's everything from my presentation. Thank you for being here and should have any questions feel free to ask. Thank you, Victor, for your interesting topic. I see that we have one question so far. It's from Yaroslav. BPF looks awesome. What can it be used for? So it can be used for practically anything you can think of. If you want to execute some code inside the living kernel, then BPF is probably your way to go. However, these days is, first of all, it's used for tracing, such as the BPF trace, but there are other projects such as libbpf tools and BCC and so on. And second major area where BPF is used is networking, where it is represented mainly by the XDP project. So if you want to see more about networking, search for XDP. Okay, thank you for the answer. Another question is from Jan. So I don't need to need no debug info for tracing anymore, right? Or what I need to install on my system for use it? Exactly. You don't need the debug info package anymore. What you need to install, you have to do DNF install BPF trace, and that's it. That's the way to go. Everything other will run out of box. If you have a reasonably recent kernel, you will also have BTF information built in so you will have all these cool features that I have just presented. Okay, thank you. And last question it seems is from Zbigniew. The replacement for LLVM slash clang, will it be useful outside of BPF trace, for example, to generate system MDE BPF filters for units? So this is a very, we haven't started working on this feature that is just being discussed these days, but the general idea is yes, we would like to provide a library for this so it can be used outside of BPF trace. Okay, thank you. And another question pop up from Martin. What's missing in BPF compared to DWARF? So, not sure exactly here. For instance, I know that DWARF has line informations about the source lines from the original source. I'm not sure if BPF has that, but I would say that perhaps it already has or it will be added in future. If you're asking why BPF is so smaller than DWARF, this is because DWARF contains a lot of duplicated data. So it's all duplicated and represented in a more efficient way. That's why this is the main reason why BPF is so much smaller than DWARF. However, there probably will be some missing information, but I'm not really sure which ones those are. Okay, and another one from Yaroslav. If I want some simple tool for tracing single program for security, I think using BPF is the way to go. Is there any other option? System tab is problematic because of the debug info needed. Yep, exactly. BPF is your way to go. Either BPF Trace, if you want to do some fast scripting for, if you want to, yeah, if you want to do some fast scripting, if you want to do some more complicated tracing, then either write more complicated BPF Trace script, or you can look into other BPF tracing tools such as, for instance, BCC, which is written in Python and is more suitable for large, long maintenance projects. But yeah, BPF is your way to go. Okay, and last question is from Spinev. Will we evolve DWARF to be more like BPF, for example as duplication? This is not the question to me, this is the question to DWARF developers. Yeah, I agree in this one, and we're in the end of our time, but Victor will be in our virtual venue, WorkAdventure. So if you have any questions, these were not answered, or you just want to discuss this topic, because it seems it was really interesting, a lot of people are interested in this. Just hop in the WorkAdventure, there will be a link in the chat, so you can go and talk with Victor about all of this, and thank you Victor for your talk, and I will see you in the next session.