 Hi, I'm Frank, and I am here to talk to you today about lightweight EBPF tracing with Ply. So long story short, I wrote a book on embedded Linux. This is the book. It's a big book. My co-author is Chris Simmons, and he will be speaking tomorrow on Android and Yachto, compare and contrast. Please check that out. I currently work for Lunar Energy, which is a startup in Silicon Valley, and they are doing home electrification for a sustainable future. So the agenda for today, first I'm going to explain what EBPF is, then I'm going to talk about the tool, Ply, and why you want to use that on an embedded system instead of BPF trace. Then I'm going to show you what Ply can and can't do. After that, I'll show you how to enable EBPF in a Linux kernel. And then we are going to add BPF to a build root image for targeting the Beaglebone black. And lastly, I'll give you examples of the kinds of scripts that you can write in Ply. So what is EBPF? EBPF is a kernel feature. It was introduced in Linux 3.18, but it wasn't really usable until about Linux 4.4. And if you want to use it now with user space tools that require it, I recommend using at least the kernel of 4.9 or later. EBPF is a virtual machine running inside the kernel. It is a sandbox environment. So it runs, it runs byte code that is jit-compiled into native code inside the kernel. It provides an event-driven runtime. So it's interrupt-driven. Code executes when servicing and interrupt in the kernel. It is low overhead. And because it is extremely low overhead, you can use it in production. And we'll see that a number of major players in tech are in fact using EBPF in production for observability, networking, and security. How does EBPF work? So you take an EBPF program. In our case, it is written in Ply. A process, in our case, that process is the Ply tool, takes the EBPF program, compiles it into byte code, and sends that byte code into the kernel using a BPF system call. Inside the kernel, the BPF byte code is verified. And if approved, it is then jit-compiled into native code or the CPU architecture. And that program is attached to probe points in the kernel, and will then run whenever you hit those probe points. So in this case, it's attached to the send message and receive message system calls. So whenever one of those functions executes, the EBPF program will run. Who uses EBPF? Netflix uses it extensively for observability across their various microservices. Facebook uses it for load balancing and to thwart DDoS attacks. So BPF stands for Berkeley package filter. So they use it precisely for that, filtering packets. They need to drop packets very quickly so that the system isn't overwhelmed. Google uses EBPF for Google Kubernetes engine, their version two of the data plane for that product. And AWS uses it for Bottle Rocket. Bottle Rocket is a micro virtual machine monitor. And it's intended for serverless and container workloads. Microsoft is porting EBPF to Windows. So it will no longer just be a Linux feature. It's on its way to Windows. And New Relic acquired Pixylabs, which was an EBPF-based startup. And this was in late 2020. And they have since open sourced the Pixylabs code base. And you can use Pixylab projects. And in fact, people like AWS are using Pixylab code. Here are the leading EBPF projects. There is the compiler toolkit and library, BCC. There is BPF Trace, which is a high level tracing language, similar to the one we are going to look at closely today. Catron, which is Facebook's high performance load balancer. Silium, which is another Silicon Valley startup. They're the biggest proponent of EBPF right now. And their technology is being used in Google Kubernetes engine for V2 of the control plane. And lastly, there is Falco, which is an open source security monitoring tool, also targeting Kubernetes. So since BPF Trace is out there, it's being used in the cloud. Why do we not use BPF Trace and embedded? Here are the reasons. BPF Trace depends on LLVM at runtime for compilation. So if you've ever had to build LLVM, you know that it's a mountain of C++ code and it takes a long time to compile for embedded targets. BPF Trace depends on the BCC tool chain at runtime. And because that tool chain also depends on LLVM, even if you were able to remove the LLVM dependency from BPF Trace, you would still get stuck with it because BCC also depends on it. But now the main reason we can't use BPF Trace for embedded is because the BBC tool chain only runs on select 64-bit architectures. So the ones that are most known are x8664 and ARM64, not as common in embedded. And BCC requires the kernel sources at runtime. And you don't know which parts of the kernel source it requires, which means you usually end up deploying all of the kernel source and you end up with an embedded image pushing two gigabytes, which is not what you want. So why Ply? Ply has minimal dependencies, only libc at runtime. It can target 32-bit ARM and PowerPC, which are used in embedded products. And because it's written in C, not C++, it's easier to port the more architectures besides those two. It is included in build route as of the February LTS release this year, so adding it is trivial. And syntax is very similar to BPF Trace, as we'll see. So I'm going to talk about two kinds of instrumentation you can do with Ply. Dynamic instrumentation, and this is what detrace does. So if you've ever heard Solaris Guru's brag about the incredible things they can do with detrace, now we can do them with eBPF and Ply. So the way dynamic instrumentation works, we're inserting breakpoints at instruction addresses. And so that when you hit these breakpoints, the BPF program is then triggered, information is recorded, and then after the program completes, execution continues within the kernel. The problem with this kind of instrumentation, it's susceptible to interface instability. So the names of functions inside the kernel can change without notice. And also, I've discovered that the names are also, in some cases, architecture-specific, as we'll see. So you don't always know what the name of a system call is unless you go digging for it. The other problem dynamic tracing is susceptible is inlining. So the compiler could just inline your kernel function away, optimize it away. And in which case, then there's no way to attach to it because it doesn't exist. The other kind of instrumentation is static. So this is stable event names hard-coded into the source code, which maintainers have to enforce. So it's a contract. You are not going to break this API. The names of these events will stay forever. What can Ply do? It can instrument the entry into kernel functions. So here, I'm going to be probing the open syscall. It's got the weird name, as you can see. I can't successfully attach to the name you would expect. I can instrument exit out of a kernel function. That's what kret probe does. And then I can also do static trace points. So in this case, we're attaching to sked wake-up, which is a name that doesn't change. What can Ply not do? It can't do user space. So you can't attach to functions in user space, either at entry or exit. So in this case, we're trying to probe the readline function in bash. These examples are from BPF trace, which can do this. You can't do it in Ply. And the same goes for static trace points in user space. So here, we're trying to instrument a query start in MySQL. You can do it in BPF trace. You can't do it in Ply. So what I'm going to do now is I'm going to show you how you can enable eBPF in the kernel. And I'm also going to show you how you can add the Ply package to a root file system image. And this kernel and root file system are intended for the BeagleBone Black. There's a GitHub repository out there. So everything I'm going to show you, I've already done. If you don't want to do it, you can just clone my repo, do these four steps. And you will end up with a microSD card image that you can then boot on the BeagleBone Black and run the examples I'm going to show you later on. You simply log in as root. And the password is T-E-M-P-P-W-D. There is an SSH server on it. So once you know the IP address of the device, you can simply SSH in. So how did I go about enabling eBPF in this kernel? I started with the August release of build root. And the reason I did that was because the maintainers of the BeagleBone DevConfig had moved to a 5.10 kernel. Previously, the BeagleBone DevConfig was at a 4.19 kernel, which is a bit long in the two for this point. So to configure this kernel, enable the kernel features needed for eBPF, we do make Linux menu config and at a minimum, we need to select the BPF option, the BPF system call that I showed you earlier for BCC, which technically we don't necessarily need for ply, but it doesn't hurt. I've enabled the class BPF and action net action BPF modules as well as the JIT compilation for BPF. From Linux 4.7 onward, you also need these other two options. Ply requires even more kernel support. So not only do we need eBPF, we need K probes, we need trace points, we need F trace, we need dynamic F trace, and this is the killer here. You need K probe events on no trace, otherwise you will get these errors that say, could not probe no trace function because they are no trace. So you want to be able to probe those, you need to enable this option, which requires that all the preceding F trace stuff be enabled in order to make it available. Adding ply to the build root image is a piece of cake. It's already in the release. You simply go into menu config for the root file system, drill down into target packages, go into debugging profiling and benchmark. There's the ply package, select it, save the def config, and the last line will be added to your def config for the image. Target architectures, they're the usual 64 bit, and the 32 bit ARM and power PC that we need for embedded targets. What does ply need to build and run? So it needs dynamic library support. It needs a tool chain with dynamic library support. It needs a cross tool chain with kernel headers that are greater than 414. I have found that the easiest way to get a tool chain with these things is to simply have build root build the tool chain for you. It's simpler than using a pre-built tool chain from Leonardo or ARM. We've seen what features we need enabled in the kernel to use EBPF. On your build machine, you will need flex and bison in order to cross compile ply, and once you've got it on your target, you will need to either log in as root or use the capsis admin capability. I find it's just easier to log in as root since we're not doing this in production. Otherwise, you won't be able to run your ply scripts. And lastly, you need the debug file system mounted at sys kernel debug for ply to work. Here are ply's command line options. The most interesting one is dash capital S, which will show you the BPF bytecode instructions. Pretty cool. The other useful option is the dash C command, which will run a command in the shell and when that command exits out of the shell, it will kill the ply trace session for you. So you don't have to. Here's an example of a ply one-liner that uses that dash C option. So in this case, I'm running the dd command, I'm writing 100 blocks from dev zero to dev null, and then I'm executing the following ply one-liner, which is going to probe all the VFS kernel calls. So the star is a wild card and is going to count the number of calls to VFS functions and display them to you by executable and function called. The second one-liner again uses a wild card pattern. The K is shorthand for K probe. So you don't have to write K probe out. And this time you're looking at all the system calls. So everything with that leading prefix. I know it's a weird name that goes back to the name stability issues that I talked about. And so this will count all the sys calls and display them to you, their counts by caller. So here's that dash C one-liner that called dd. So there you can see the dd command ran, 100 blocks were read, and 100 blocks were written out. It automatically deactivated the ply session, and here is the output from ply. You can see then the first column, the executable name, there's dd, there's drop there, there's ply. And you can see for VFS reads and VFS writes at the bottom, we have counts of 101 and 108, which correspond to the 100 blocks that we were dd. Now here's the result of the second one-liner. We're counting sys call system-wide by function. You can see that get time of day is being called a heck of a lot. 1,431 times. Same with get time, 486 calls. So something on our system really wants to know what time it is. Now let's look at ply syntax. Ply syntax is very, I'm told is very much like ox. So apply program consists of multiple probes. Here is an example of a single probe. So it starts with a provider, k probe, k rep probe, trace point, colon, and then the probe points. So those are the function names, the patterns that we want to match the function names. There's an optional predicate. I won't go into that. Within the curly braces is the actual program that gets executed. And again, many BPF programs will have multiple probes. So you saw that the first thing in the far left was a provider. These are the providers. We have k probe to match entry into kernel functions, k rep probe to match exit out of kernel functions. Again, this is dynamic instrumentation. And we have the trace point provider. So these are key words. So trace point are those static trace points we looked at earlier. And in order to find out what the names of the kernel functions are, you can cat this call sims file in the proc file system. And likewise for the trace points, you can list the contents of this events directory, which will give you different subdirectories for the various subsystems in the kernel. And so when you look into one of those directories, you will see the different events for those subsystems. And you can also use perf to get these names. If you do perf list, you can then give it some patterns to search for. Builtins, this is where the magic happens. So these are global variables, functions that are automatically available within your ply program. So com is the name of the running processes executable. Pid is the kernel thread group ID. So if it's a single threaded process, the PID is going to be the same as the K PID. If it's a multi-threaded process, there will be multiple K PIDs for one thread group. Time is a timestamp. So nanoseconds since the system booted, very useful if you're trying to profile. And printf is the function we know and love. So if you want to do printf debugging of kernel functions, knock yourself out. It just works. And if you don't know the type of the arguments, the function arguments that you're trying to print out, you can just use %v in the format string. And it will infer the type just like print would do if you were using print instead of printf. Here are other handy variables. Now, these are provider-specific. So they're not available in every context. You can get the stack trace, the kernel stack trace. You can't get a user stack trace in a string format. And you can just print that out if you want to know where you are when you hit a probe point. So this only works for K probes and K rep probes, so entry and exit. On entry, you can also get the name of the function that triggered the probe using the caller variable and the arguments passed into the function using arg0, arg1, etc. K rep probe, which fires on exit, there is a retval variable, which will give you the return value out of the probed function. The basic data structure in ply is a map. So keys, key values, and maps, they take the form of a name and then what could be multiple expressions within the square brackets. And those expressions are separated by commas. And within the square brackets is the key into the map. So here's an example of assigning into a map. I'm calling this map rx. I have this magic arg0 variable already available to me in the context. Whatever it evaluates to is going to be the key into this map. And then I'm going to assign a timestamp into this map so that I can then retrieve that timestamp later. Similarly, if you just want to get the value out of the map, you can do it like this. So we have a map called rx. We want to fetch whatever the value that arg0 is. Use that as our key. So that will get us the previous timestamp. And by subtracting that previous timestamp by the current timestamp, you are able to know the time elapsed between kernel probes. Aggregations are a special kind of map. They start with an ampersand. You don't need the name. The name is optional. And the expressions within the square brackets act as keys just like with a regular map. What you use aggregations for is to capture the result of an aggregation function. So we've already seen the count function. We use that to count various calls to different functions. And that just bumps a counter. Another aggregation function is quantize. And that will give you the distribution of the map values. And we'll see what that looks like right here. So here's a one-liner. We are attaching to the exit point of system read. And we are calling this distribution size. And we are calling the quantize function on the return value from the read, which is the size of the read, as we know. So we apply functions like they run until you kill them. So when I called ply, I actually have to hit control C after a while in order to force it to exit. Once I do that, then it dumps out its output. And this is the output that you see on the terminal. What we have here is two calls to read within a range of 8 to 15 bytes. One call between a range of 16 to 31 bytes. And then we can see that the vast majority of the calls to read were between 128 and 255 bytes in size. Another handy ply script is OpenSnoop. So OpenSnoop will show you the process IDs and executable names and the path to the file that those processes were trying to open as they open it, as well as the return code. So that's what we're printing out here in the return probe. Print F debugging. So this is what the output from OpenSnoop reveals when running on our Beaglebone black target. We can see that there's a Redis server process, PID 284, and it is trying to open these files repeatedly very quickly. I only ran the script for a second or two. And one of those files is the local time. So for some reason Redis really wants to know the time. And because the return code is negative, it looks like something is going wrong with the open, which is probably why it keeps repeating itself. Similarly, there is exec snoop, which looks for short lived processes. So if you've ever tried to debug a system in production, you'll run into the situation where there is a service that is in a crash loop and is just, you know, is just spawning processes and eating up resources on your system. So exec snoop will, what it does is you're instrumenting the call to exec VE. You're capturing the argument passed into it and you're storing that argument. So that argument is the executable name. You're storing it by kernel PID into the map. And then upon exit, you are going to print out the user ID, the path to the executable and the return value of the execution. So this is what this looks like. I start exec snoop in the background. I go and I stop Redis by calling it's in its script. And then immediately what you see is the output from exec snoop. So it's showing a user ID of zero within the parentheses because root is what ran that in its script and it's an exit code of zero. And so it's stopping Redis and in order to stop Redis, that shell script calls Redis CLI, which again triggers exec snoop. And so we see the call to Redis CLI to stop Redis. The OK is just the output from stopping Redis. Some time goes by. I start Redis backup using the same init script. We see the init script execute again. This time, instead of calling the Redis CLI, it actually calls start stop daemon to start the Redis daemon. And then we see the Redis server itself start up. And it has a user ID of 1002, so it doesn't run as root. It runs as a different user. And so that start also had a return code of zero. I'll skip this slide. This is a ply script that looks at TCP send message and TCP receive message calls. It counts them. It shows you what executable made the calls and what direction the packet is going. So that's the send and receive that will show up in the output. It's part of the key into the aggregation. So again, run our script in the foreground. This time I call Redis CLI to do a latency test. And what that does is it just starts pinging the Redis server. I let that run for a few seconds. Eventually it pings it 1,025 times. I kill that. I bring the ply program back into the foreground. And then I kill that. And now I can see the Redis server had 1,025 sends. 1,026 receives. And likewise Redis CLI has the same number because it's just pinging Redis server over local host. You can see that Drop Bear, which is my SSH server, has a lot of sends, 1,041. So I'm SSH'd into this machine. So because I'm SSH'd, it's sending packets. If you look at the call that I made to Redis CLI, you'll see there is 1,025 samples. Those samples had to count up from zero. So that was a send over the SSH connection. That's where those sends are coming from. And here's my favorite heap allocations. So this is heap allocation size distribution. So in this case we are instrumenting entry into the BRK system call. What BRK does is it takes an address in memory and this address is the new end of the data segment of a process. So you could think of it as the heap. It's like the top of the heap. So by looking at that address and comparing it to the previous end, you're able to tell how much the heap is growing at each call to BRK. And so in order to show a distribution, I used the quantize function. In this case I'm taking the new end and I'm subtracting the previous end and that is the value I'm quantizing on. So once I've done that, I save it into the map as allocation size and then I need to update the end of the data segment for that executable in K-PID with arg zero so that my next comparison starts from the new end. So here's an example of this in use. I flush everything out of the Redis server. I start the heap allocation ply script in the background. Then I run an LRU simulation using 100,000 keys. So it's going to insert and look up values in the cache. And here you can see the hits and misses. As you can see, the misses go down as the cache becomes more and more populated. So I stop that simulation. I kill the ply script and here is the distribution of heap allocations. The majority are between 4K and 8K. There's a good number between 18 and 16K and there's a few larger allocations within 16 and 32K for that simulation session. All right. So I mentioned the book. I didn't show the book. So this is the book. I have six copies here with me and I'm happy to give them out to anybody who wants one. Oh, and how are we doing on time? Do we have time for questions? Okay, yeah. If anyone would like to ask any questions? Yeah, yeah, print. So yeah, I used it in those different examples. I call printf. So that printf is running in the kernel probe. So that's when you see the output from those ply scripts that I showed you, a lot of times that output is from the printf call. So, you know, like when you see the executable name and the PID and the return value, those are all wrapped in a printf and it's the arg0, arg1, et cetera. Those are what are getting printed out to the... Yeah, yes. It's saving it into the map. So that's what those maps do. It's saving those values into the map so that you can fetch them later and update them later. So, yeah. Yeah, it's in nanoseconds. And so keep in mind, all this is running inside the kernel. So EBPF is a VM inside the kernel and it's jit-compiled. So it's compiled into native code, you know, ARM-V7 instructions. So it's going to be fast. And, you know, it's also... It's like an interrupt, you know? So that's how it's being serviced. That's how the code is being triggered. Okay. No problem. Yeah. Well, okay. So there is a bug here. So I'm glad you caught it. Yeah. So when you go to heaps-compid, like, the first time you call this, like, that's not there, right? Like, you didn't define it. So I think it's either going to be nil or it's going to be garbage. So, yeah. I don't know. I wouldn't count on it. Yeah. Yeah. But yeah, good catch. No, only kernel. I'll repeat the question for those listening online. So Ply cannot do... The question was, can Ply do user space tracing? The answer, as far as I know, is no. The second question is, can it be added, given that Ply does not have LLVM or BCC as a dependency? I don't know. I honestly don't know. My guess is that there must be some challenges. Otherwise, they would have already tried to do it. Oh, it's 5.50. So we're right on time. And any more questions? If not, I'm actually going to go onto Twitter to see if... I mean, sorry, not Twitter. Slack to see if anybody has questions on Slack. It doesn't look like it. Yeah, so let me get your books if I can sign them. And thank you for coming.