 So we'll talk about trace loop on Inspector Gadget, BPF tools to trace system calls. My name is Albonne Crecky. You can reach me at this address. And this is in the context of using that on Kubernetes because in Falk we do Linux on Kubernetes things together. So I'm really interested in, I really like trace, trace as a way to debug my applications. I like the output on the thing it gets. And this is driven by, I want to use that on Kubernetes and use BPF on top of it. And I will describe the motivation a bit after. So trace loop, it's a tool to trace system calls in C groups using BPF on overwritable path ring buffers. And Inspector Gadget is something on top of it, running on Kubernetes. And one of the Gadget of Inspector Gadget is trace loop to be able to use that on Kubernetes. So I will not go long on that. I think you had several BPF talks before. But basically, how BPF talks, you have a BPF program written in C, compiled with C-long and LVM into a BPF by code, and then uploaded to the kernel with the BPF system call. The kernel will verify that your BPF program doesn't cause any problem. And then later, it will be executed. And your BPF program can interact, communicate with user space through BPF maps. I will give a, so trace loop is based on, it's kind of on top of the shoulder of giants. It's not the first BPF tools. So I want to mention BCC and BPF trace as really useful tools before going into trace loop. BCC find it very useful to learn about BPF whenever I want to know about the specific BPF helper function or look at the code to see how things are done. There are a lot of examples for tracing, for networking, and so on. And things to get inspiration from that's really useful. And it has a lot of tools, tracing tools, to trace a different aspect of the Linux operating system. On BPF trace, we had several talks about this today. It's used a different language. It's on language. And it has other different tools, tracing different parts as well. So based on that, trace loop use BPF as well, but it's not based directly on those two. It used go BPF to do its BPF job. On the goal of trace loop, the idea was to be able to retroactively trace Kubernetes pod that I've crashed. So let's go into that idea. I mentioned I really like trace, and I want to use that on Kubernetes. But it has some drawbacks that make it not possible to use it for all the pods all the time in production because it's too slow, and it's not really the use case to use it all the time. So if I use a trace to debug something, I cannot use it on all the Kubernetes pods, but only on specific things I want to trace. And it's often difficult, because if something crashes, I don't know about it in advance. And once it crashes, it's too late to run the trace command because the process is not there anymore. So the idea of trace loop is have a flight recorder. It's like we are always tracing all the system calls from all the process in the Kubernetes pods in all the containers. And recording these events into a ring buffer that is almost never read, unless there is a crash or the user want to know something about the logs. So in some in only a post situation, we will inspect the ring buffer. So trace and trace loop work differently. Strays use ptrace to get the information they need. Trace loop use BPF on trace points. The granularity is different. Strays use trace specific process. You can have one or several process to trace. On trace loop use C groups. It will filter the events based on C groups. Strays is where we have seen in the previous talk is faster than before, but is still slow due to the several round trips between the kernel and ptrace. Trace loop is fast because it actually doesn't get the information unless the user asks for it. But in a general case, the information is written in the ring buffer without being read. But the RolaLBT is quite different. Strays is reliable. You always get all the events you need. It cannot lose events. And it work in a synchronous way. You see the event one after another. Trace loop is different. It's possible to lose events. In some cases, there are too many events. And the ring buffer is full, for example. Trace loop will not notice everything. And in some cases, Trace loop might not be able to read the system call parameters. In some conditions, for example, if the parameter of a system call is in a page, memory page, which is not mapped, maybe it's swapped away, when you try to read that from a BPF program, the kernel will not be able to load the patch from disk. So in some conditions, we don't get everything. But still, the feature for the user is the same, is to be able to see the system calls. So here is how it looks like. At the beginning, I have a transparent on the transparent c-center. It means every time any process on the system will execute any system calls, this BPF program will be executed on the top here. And then I want to distinguish which Kubernetes pod or which c-group or which system the service I'm running on. So I will look at the c-group and then re-root the execution to a different BPF program depending on the c-group. So I can distinguish if it is pod number one or pod number two or something else and execute a different program that will have its own path ring buffer. And then this BPF is continuously written when we have new system calls, but not read unless the user asks for it because something crashed and the user wants to debug. So this ring buffer is configured in a way to be overwritable, which is not the default. By default, ring buffers from Perf are when it's full, you don't write anymore. But this one are configured to overwrite like a flight recorder. This is the same view as before. It just explained in a different way. In trace loop, I have two different BPF object files. Let's say a main object file with tracing c-center and c-sexit. And then re-rooting the execution on different modules depending on the c-group. So I will start with a demo of c-group. Let's see. Here I have a SSH demand running on my laptop. And I have a command here that I will use. Oops, sorry. This command will ask trace loop to trace what the system call from a specific c-group and I specified the c-group of the SSHD service. So now it starts to record everything that SSHD does. So in another window, if I do something like SSH localhost, then SSHD does things. It doesn't print anything. It only prints something when I ask something with ctrl-c. And then I get all the last system calls from all the processes inside the SSHD c-group. I add another demo of possible integration in a system d-service. Here is just an example where I have some shell script that is executed and I want to debug. Trace loop can work as a command line tool. Here is the demand and we can ask it to add and remove c-groups on the fly. I will not demo it now because I'm limited in time. So that was for trace loop. Now how do we adapt this kind of tools into Kubernetes? For Kubernetes, we usually don't care about specific PIDs. The granularity of tracing is usually the Kubernetes pod. And we usually use concepts like Kubernetes labels to select the different pods. And the user doesn't want to use SSH, but have their own kubectl style interface. There are already tracing tools for Kubernetes based on the different BPF Linux tracing tool. So for example, BPF trace, there is kubectl trace. And BCC and trace loop are used both into inspector gadget. So inspector gadgets are some gadgets using BCC. And some others, the trace loop gadget, use trace loop. So the user doesn't need to SSH into any worker node. They just use the kubectl user interface to connect to the Kubernetes cluster on the rest. I will demo inspector gadget quickly. So here I have some Kubernetes cluster. Inspector gadget command with some subcommands. And one of the subcommands is a trace loop. And then I can ask to list the different traces. And what I will do, I will start a new pod with some shell scripts, with some bugs in the shell scripts. So it will not do what I want. I don't get the result of the multiplication I wanted because of mistake in the shell scripts. And here I can see that the pod is here, is completed. And I see a new trace from my new pod. And I can get all the last system calls executed by this. So here I see that the BCE program inside the pod read from a standard input some multiplication and print the error, sorry, print the results. OK, so that's inspector gadget with the trace loop component. And I will not demo the other gadgets, but I can just show on the web page there is another gadget called execs loop. That's directly the code from BCC that applied on Kubernetes. So I can use inspector gadget with execs loop and specify Kubernetes level to select which Kubernetes pods to select on additional filter, like on namespace or on nodes and so on. And it prints the same thing as execs loop from BCC. OK, so now I want to talk a bit about the difficulties I had when working on this project how to select the pod from the BPF program. The issue as I have is when I had a transparent with BPF, this BPF program is executed for all the processes on the system, not for a specific pod. So I need to filter on some C group or process. And I usually use this BPF helper function get current C group ID, which is available in Linux 4.18. This is only working on a C group version 2. So on the issue is sometimes, most of the time, Kubernetes use a C group version 1 only. There is only some recent effort to make it work on C group version 2. So what I needed to do is to enable C group version 2 on my Kubernetes cluster. So I needed to change the configuration of system D, the curve, cubillette, and so on to use C group version 2. And then how do I select a Kubernetes pod from my BPF program? What I did in this version was to add a BPF map containing the list of Kubernetes labels. From that, I have some BPF pseudocode. I read the C group. And from that C group, I read the Kubernetes labels from the BPF map. In that way, I can filter on specific labels. To be able to add those Kubernetes labels into BPF maps, I added OCI hooks. So if you know RunC as a follower spec called OCI, then there is a way to add hooks at the beginning and at the end of the execution of a container. And at this point, from the pre-start hook, I asked the Kubernetes API to give me the list of labels and populate the BPF maps. So if you know BPF maps, you might see that's not really a correct way to do it, because I do string comparison into BPF. So that's quite not good, but that's the first version. I have a pull request to do things in a different way, to use IDs instead of string comparison in BPF. But nonetheless, that works. Now I will talk about some stop gaps that I did in TraceLoop. Things that are not perfect, but yeah. So at the moment, TraceLoop works on a different Kubernetes configuration. It works on KinFox, on Linux, on Kubernetes distribution, FlatCard and Locomotive, on Minicube as well, on GKE as well. Even though those Minicube and GKE were using Linux 4.14, this version doesn't have the BPF helper function that I needed. It doesn't have C-group version 2 by default, and it doesn't have a way to set up an OCI hook on RunC. So to work around this problem, to make it work on older Kubernetes setups on older Minicubes, I use some hacks or stop gaps. So I don't use this BPF helper function anymore. I wait the namespace of the container by following those structures, the current structure on the UTS namespace and the inode of that namespace. And I cannot use OCI hooks. So I cannot add from user space a new BPF module for each new C-group, because I don't have a hook to call that code. And instead, I pre-populate ring buffers and modules at the beginning, and I detect from BPF code whenever there is a new container, when there is a new namespace. And then from the BPF code, I will update the PROG array map to redirect the execution flow to the correct module. In that way, I can get the very first events. So if I were to do that asynchronously when there is a new container, it might start to execute some system calls before I have a chance to get them. So but in that way, I catch them from the beginning. And it works on Minicube and all those kind of versions. On later on, we can see what the BPF code discovers from the new namespace with the Kubernetes API to know what belongs to which pod. That's how you see the demo. It works on Minicube. Thank you. Is there any question? So do we have any questions? How deep can you go in tracing all this stuff? I saw you were printing some strings, but that's probably all you can do. Or can you, like, print structures or even something more sophisticated? No, I cannot. So if I will go back to this. So at the moment, I lack all the knowledge that comes from the stress project in how to pass the structure from other arguments and so on. So at the moment, I just get the integer argument. That's easy, because that's directly available from BPF. And then I get, when it's a string, I use BPF probread to get access to it. But I don't do any more passing of that to the reference of the structure. So that's the limitation of the implementation that will be possible to add. But at the same time, I don't really know how to get access to the knowledge from stress. Everything that is already implemented in stress into that, I'm not really sure. At the moment, it automatically pass the list of system calls from debug.fs. And from that, it knows about the arguments if it is a jar, star, string, or something else. But I don't really do much more than that. Yeah, and there are also some multiplexing system calls that can pass all kinds of different types. So yeah, it's really complicated. So I just wondered, what are your plans in this respect? Do you plan to overcome these limitations somehow? I'm not quite sure. I know that for LSTART, I think, that didn't work just as it is. I'm not sure if it is because it's a multiplexed system call or not. But there was someone who added a workaround from that. But I think adding a workaround for each system call is not really the way forward. I'm not really sure. I don't know. OK, thank you. All right, do we have another question? It seems like not. So thank you very much. Thank you.