 Good afternoon, everybody. It's 4 p.m., so let me start another talk on kernel tracing with eBPF. My name is Stanislav Kozina. I work for Red Hat. This will be very gentle introduction in kernel tracing with eBPF. There won't be any details, so those who are interested in learning something more up-to-date, what is the latest development on upstream, please stay for the next talk, which will be another talk of BTF that will go into a bit more details. There's really just interaction, what we have as of today, but very brief. What is eBPF? It's much easier to say what is not. It's not just bucket filter. That's a good start. It's pretty interesting that if you talk to some people who are using eBPF, we have made some observation that they either talk about networking use cases or some tracing, and there's very little overlap between the two. We are trying to point out that eBPF is like wall ecosystem, which is fairly powerful, and it's not easy to have elevator pitch talk about what is eBPF. There are either a bunch of networking features. It basically allows you to receive all network flows directly from the network interfaces and do some very quick analysis of these packets, modify their content, drop them, or do anything you want on the network flows before it enters the Linux stack, but it's not the Linux stack bypass. You can still choose to forward these packets further to the Linux network stack. It also provides a lot of tracing features because you can use these eBPF programs to basically hook to K-Probes, TracePoints, trace events, and all these allow you to get any information out of the kernel you might want. Basically, you can already achieve all of this using kernel modules, but a lot of people don't like entering their own kernel modules to their kernel, and sometimes you might need to recompile the kernel modules. eBPF is a safer way to achieve a lot of things you could otherwise achieve with kernel modules but without kernel modules. There's a lot of safety mechanisms, so eBPF programs should not be able to harm your kernel. It should be safe. In this talk, I won't be talking about networking features, not at all. I'm sorry. It might be easier to just start with an example. Imagine you have your server which has like 80 different disks, and all of these disks are fully busy, and you need to figure out why. What application is causing the IO to go down to the disks so the disks are just spinning and overloaded? Obviously, it's not just any application which is doing the most IO, because if you're on applications such as copy.02.0, it's all in memory. This copy is causing a lot of IO, but it's not going down to the disks, so this is not the guilty application. These days, it's fairly easy. On current Linux distributions such as Federa, you just install BCC tools, and you run a tool called Biotop, that will very quickly tell you what applications are causing the real IO going down to the disks. If I'm running another command, which is just copying all my physical drive to DevNow, so I'm reading all of the content of the physical drive, in Biotop, I will quickly see that there is a DD which is just running as much IO as possible. There's a little bit of dmcrypt because my disk was encrypted, I suppose, and there's Biotop, which is a little surprising, but probably is doing some IO. So it's created. It's a Python script, which is cool. We all love Python, and it's perfect. Python for kernel programming, that's what all kernel programmers would like to see, right? Not the case. It's a Python script which contains embedded C code, which is actually, that C code is to collect all these data out of the kernel, and it's just passed up to the user land, where you can do some post-processing of these data in Python. It's a great advantage. You have the Python wrapper, or even Lua wrapper, on top of this information, because in Python, it's fairly easy to, like, transform the information further and pass them anywhere you might want. So, like, typical setup might look like you're running BPF program in your kernel. That gives you some information about the kernel, aggregated information. You can use your BPF programs to basically, like, monitor all network packets and just give you some statistics about, like, what's the average size of the packet, or how many drops there are, and you don't need to create any new kernel interfaces. And after you collect these aggregated information and you pass them to user land, in Python, you can just convert them to JSON and upload them to your databases, so you can have some, like, data center analytics, which is running all the graphs, and anytime there's some, like, sudden blink somewhere on one of the thousands machines you can immediately see and, you know, you can have some analytics built in Python. So what's done with this C code is that there is another library, I will talk a bit further in detail, which is called BCC, and this library is actually using LVM libraries to get this C code compiled in something we call BPF bytecode. So how it looks overall. On the left side, you have your user land application. You have your C code in that application. Then one is being passed to LVM compiler. LVM compiler compiles the BPF bytecode. That is, from the user land application, being passed through the BPF CIS code to the kernel. There's running a verifier, which makes sure that your bytecode is safe. It will always terminate. It's done, it is doing a lot of checks, and some of them I don't understand that much. For example, it's checking for unreachable instructions, which is not harmful, right? But anyway, it's checking for unreachable instructions. So it will make sure that your code is nice and safe. Then the code is actually executed by the BPF runtime. Usually it's even jitted before it's executed. So there's a very simple JIT compiler which translates your BPF bytecode to your real architecture instructions, and these instructions are executed. These BPF programs can be attached to various points, some of the networking points, such as XDP hooks and sockets. But in our case, for tracing, it can also be attached to K-probes, trace points, and perf events. So you can collect information from these events. As part of your BPF program, you can do the analysis of these information. Sometimes it's just too much in kernel, so you can quickly do some aggregations. And just as a result of these aggregations, it can be passed back to the user space. You have two recommended ways how to pass the information. One is through perf event buffers, which are perf CPU very quick buffers, which can be used to just pass anything you want. Or there are some special constructions which are called eBPF maps, which are basically Python dictionaries. You can map anything you want to anything you want, and it can be accessed from both kernel and user space. So you can still use this to pass information. But I suppose it would be slower than perf event buffers, because with perf event buffers, it's just mapped memory per CPU. So, how do biotubes really create it? What does C code, which is used as a BPF program, needs to do? It hooks to several places in the kernel. It hooks to some block functions using K-probes. One of these are block account I'll start. I don't know the layout, but I trust that they know what they are doing. So probably this function is used to track any time an application receives some IO operation, like read or write. Then there's another function, blockMQStartRequest, which... So the application issues the request. The request is queued until the disk has some time when it can actually handle the IO. So blockMQStartRequest is when it actually starts the request. And then blockAccountIOCompletion is the notification that the request was finished. So what the application needs to do is that any time blockAccountIOStart is called, that means some application is doing IO. So we need to put down what application is doing this IO. Then in blockMQStartRequest, we actually time stamp when this IO starts. And then in blockAccountIOCompletion it's called. That's when we do another time stamp. We measure how long it took, and we account for this application that this is the time it took the application to handle IO. And we just sum it all together, and then we sort, and that's how we build a list of applications causing the most IO operations on disks. So I mentioned that this is all done. All the BiotopTool is done using IO0BP of Compiler Collection, which is BCC, which is the library, which is simplifying the work with LVM. It also provides a couple of other hooks, such as it loads your program to the kernel, it reads the output from the kernel, it provides the Python binding, which simplifies like how you read the data and how you can actually print it and etc. Currently, it's good to note that any time you actually run the BiotopTool, the C code is compiled. The Python code is not compiled, but the C code is compiled on execution. So this works around for problems like what if I'm running my code on a different version of kernel than it was developed on. So you just try to compile, and if the kernel would be so much different from the one you wrote the application for, very likely the BPM program would refuse to load because the verifier might check that you are accessing some offsets which do not exist in this kernel, for example. As a part of this BCC library, there's also a huge set of tools such as Biotop. There are like 100 more tools, and these serve as pretty good examples about what you can do with BCC, but these are also fairly useful on their own, such as Biotop. I can't think of another way how to achieve what you can do without Biotop. It's pretty unique. You could probably do the same with kernel module, system tab, and other kernel tracing features, but Biotop has just already created a tool which you just run. You don't need to configure anything and it gives you the data. I already mentioned system tab. System tab is a traditional kernel tracing tool which works in a way that you write what you want to trace in some special system tab higher-level language. Traditionally, this is being compiled into a kernel module which is pushed to the kernel, and the kernel module uses K-Probes to collect the information and do the analytics. But with a BPF, you can basically do the same with BPF. So there is a new runtime for system tab since version 3.2, which has stopped BPF. The idea is that you would run system tab as you do normally, but you just issue one more option which is runtime BPF and that will tell system tab that instead of building a kernel module, it should build a BPF code actually and try to run this using BPF. This is fairly new. I think this was finished like half a year ago. It works in a concept, but many of the system tab features are yet not supported with the BPF runtime. I was trying it yesterday and what I was missing probably the most are a bunch of functions which you have in system tab such as give me the current thread ID, give me the current CPU ID, give me the current name of the process, and all of these. They just fail to compile with the BPF runtime. And there are some other options which you can use. There's BPF Trace, which I would like to show in the remaining one minute. BPF Trace is another high level tracing language where you can define what you want to trace. It doesn't need debugging info because when you compile on execution, you can just put the header files there. You can compile the header files. That's how you build the debugging information you would need for your program. This is LVM. It also uses VCC to simplify what it's doing. It's actually using Clank to compile the C header files. So you can just include your Linux kernel header files to get the structures you want to work with, and then you can access them from BPF Trace. It has access to all these like kpro, cpro, trace points, profiling functions, and it also has the aggregations. So if you want to aggregate the data in the kernel before you pass them to useLens, you just call a few of these functions, and that's it. It has about 50,000 kprobs, which means that you can hook at most of kernel functions. So this is just a very quick example about how a BPF Trace code would look like. There's one kprob defined, which is the sysread, which is the syscall handler for read. I save there NSEX is a variable, which is the current timestamp. I store that in like associative array, which is indexed by the current thread ID. Then I have the same probe, but for the return probe, so when this function terminates, I do the same basically. So here I get the new timestamp, I subtract the start timestamp, and I push this into histogram, and then I delete the start timestamp. So this is fairly simple. It should take you like one minute to write code like this, and the result is that, let me show you. So when you do some IO, it will draw very nice histogram of how long it, in average, took to handle the read syscall. So you can see that there have been, I don't know, 25,000, 30,000 syscalls, and in average, the size of the read was like less than 1k. And I have some more examples about BPF trace, but we ran out of time. So we have about four minutes for questions. We have almost 10 minutes for questions. So I can, if the question is, can you show the other example? I can show the other example. Looks like we don't have any questions. So I can show the other example. The other example I have was that I would like to show you the Biotop tool I talked about at the beginning. So as I said, it's a binder script which imports the BCC Python library. Then there is the C code, which is this one. It defines a couple of BPF maps. Then right here, it actually attaches a bunch of the C functions to given K probes. And here are the functions that are actually handling this. So this is where they are storing the timestamp in some BPF hash. And here they are handling the completion. So when IO terminates, this function is called where they look up if we have the starting timestamp of this IO. If not, we return. Otherwise, we just get the current timestamp. We subtract it somewhere. I don't know where. And this is used for collecting the information. You can see that all this code has about 230 lines of code. Most of it is Python, but about half of it is C. While you can achieve something similar with BPF Trace, fairly simpler. So instead, with BPF Trace, you could write this. In block account, I'll start. You just put down the current command so you know who is executing this command. Then in start request and MQ start request, you start the current timestamp. In account IO completion, you again subtract the current timestamp from the starting timestamp. You sum it all together using BPF Trace function sum, and you clean up the starting timestamps. And then I'm running just one-second intervals where I'm reporting who's currently causing most of IO. So if I run this, it's like once per second, it's printing all processes which are causing some IO. So when I cause some IO, that's interesting. Oh, there it is. So you should see that DD is currently causing a lot of IO. And before my laptop starts to overheat, that's it. So with BPF Trace, you can kind of get the same information in about 30 lines of code, which is much simpler and faster than just developing it with BCC, right? So BPF Trace is your tool to figure out what is happening in the kernel very quickly. And it took me... I was just playing with it yesterday's night, and it took me 15 minutes to put it down, basically. It just worked very quick. So is BCC just the old way to do this stuff? No. So both are covered by IOvisor, both BCC and... You repeat the question. Oh, thank you. The question is, so BCC is just the old way to the tracing and BPF Trace is the new version? And I think no. Both of these projects are covered by IOvisor project. BCC is older, but I think they are not immediate replacements. Because in BCC, because you can do the Python post-processing, you can do whatever you want in Python. You can actually collect data and feed them to your monitoring systems. So I see BCC more for, like, creation of production tools which are using BPF, while BPF Trace is just for playing, or system analysis. I need, right now, need to know what's happening on in my kernel and use it and you drop it. Any more questions? The question is, if I need to be a root to run all these tools, there is an option in upstream to not be a root, but using BPF, you have access to all kernel information, including all keys, all other containers, all virtual machines, everything, all users, very likely root password. Yes, really, your distribution should be restricted for root only. Because right now, as far as I know, there's no limitation of BPF to, for example, just one namespace or something like this. You just have access to, like, there is a restriction. So the risk is, you have to be a root to run these types of BPF. I noticed that you have to be a root for tracing. Okay, we have three more minutes. So one more example. One more example is ICMP Trace. So this is even simpler. This is actually something I recommend using. If you don't know all kernel and you just need to know what functions are being called, this is something, what do you do? So I'm tracing all ICMP functions and any time, any function which has the ICMP, which starts with ICMP prefix, I'll just print the name of this function. And there's a return probe for all these functions as well. So I would just print that this function is terminated. And just for fun, for one of these functions, which is ICMPReceive, I know that the first argument of ICMPReceive is a socket buffer of the packet which was just received. So I'll print the pointer to the socket buffer. So if you run this, and then you receive some packets, you basically get a trace of your ICMP stack. You can do the same with Ftrace. But what is nice here that for ICMPReceive, after you uncomment it, you can quickly get the arguments. For selecting functions where you know what the arguments of the functions are, you can just quickly trace them. So this is a very easy way. If you know there's some problem in the kernel happening under some conditions, you basically, this is how you get to that point, that only if these arguments of this function are being used, then I need to know the call stack, the arguments of the nasty functions, and et cetera. You can also use guards. So typically what you could do is that when any thread runs these functions with these arguments, then just trace everything. So you would see the call path of this thread after it met given conditions. And we have one more minute. The question is, what about overheads? There is overhead. I would say that this is probably the most effective way how you can do this. Because, so, what the real overhead is that you have the kernel code. After you attach some K-probe to this function, just the first byte of that function is replaced by some trap, I think. Yeah, which is originally a node, but it's supposed to jump across the function's thread. So we jump somewhere else. And most of these functions are like 20, 30 instructions. So basically for each trace functions, if you execute code like this one, you should expect like 20 more instructions per probe. And as I said, they are jitted. So these are like native instructions. And it's fairly effective because it's compiled with LVM. It's just normal, compilation, normal code. So there will be probably the biggest overhead is that because you have to jump someplace totally else, there will be a different cache line, very likely. So that might be the biggest performance hit. The use-based communication doesn't slow anything down. It's anyway somehow buffered somewhere. So if you use the perf buffers, it's just the shared memory between kernel and userland per CPU. So just the kernel, after your program terminates, it puts some data in a perf buffer and userland just reads it from memory in prints. There's like no syscults. I think it's ring buffer. So we'll just start overwriting. You won't see some messages. Yes, yes, yes. Thank you. So we are out of time. Thank you very much.