 Okay, so next one is going to be, Evan Craig is going to be talking to us about tracing container syscalls using BPF. Thank you for coming, everybody. Can you hear me fine? Yep. Cool. So I will talk about Inspector Gadget on Tresloop on tracing container syscalls using BPF. My name is Albon. I am at KingFolk. At KingFolk we do low-level links development. We have a links distribution, Kubernetes distribution as well. And we do cool things on top of that with BPF. So today I'm talking about stress, Kubernetes, and BPF. I really like to use stress as a debugging tool. How many of you like to use stress? Like. Cool. Like? No. No? Okay. I like it. Sorry. I still find it useful. So I like to use it to find out what my applications are doing. So I will present first the Tresloop software. What it does is trace system calls, a bit similar to stress, but done in a different way. It uses C-groups on BPF, and it uses overwriteable ring buffers that I will explain a bit later. And then I will talk about Inspector Gadget, which is kind of something on top to use Tresloop on other things on top of a Kubernetes cluster. So Tresloop is a command line tool or a demand that you use on one machine. And Inspector Gadget is on a Kubernetes cluster. If you are interested by those projects, you can join the Kubernetes Slack. There is an Inspector Gadget channel there. So it uses BPF. So BPF basically is a bytecode that you can write your program in C, compile it with C-long lvm into BPF bytecode, and you get an object file. And you can use... You can upload this object file into the kernel with a BPF system call. And then the kernel will first verify that you don't do anything bad, because this code is running in the kernel, so it would be dangerous if it was cool access to any kind of kernel memory and so on. So there is some strict verification there. And then the BPF program will be allowed to run and interact sending message to user space through BPF map and use some kernel functions that are called BPF helper functions. So that's BPF in a nutshell, and we will use that for Tresloop. So my use case, why do I want to use a stress on Kubernetes? Sometimes when something crashes, I wish I run that with stress to see what are the last system calls it does, but I cannot use stress for every application on all the process in production. That doesn't work because stress is slow, and that's not really the use case that you cannot do that all the time. If I want to use stress on only one application that is going to crash to be able to debug it, I don't always know which application is going to crash in advance, so I cannot retroactively start the stress program. So the idea is to have a system with BPF that records other system calls executed by other application on Kubernetes and record that in a ring buffer. It's an overwriteable ring buffer, so when it fills the ring buffer, it's overwrite the previous advance. And then this ring buffer is there in memory, and if a pod crashes, your application crashes, then you can ask, oh, what were the last system calls that were recorded in a ring buffer. So here I compare stress on a trace loop that works differently. Stress use ptrace. The granularity is one process, one or several processes. On trace loop works with BPF on trace points. The granularity is C-group, so I can select the C-group where I will get the traces. Stress is kind of slow. You cannot really use it for all the process for everything, but trace loop is fast because we don't actually read the system calls events. Only when something crashes or when the user asks for it, we get the last system calls. And finally, stress is reliable. It's synchronous, so you cannot use events. All the system calls that are done by your application will be printed by the stress command. It will not forget some. Trace loop is not like that. It's possible that some events are lost because it uses a ring buffer, and if you miss some events, then that happens. And sometimes you might not always get the parameters of the system calls as opposed to ptrace using the method with BPF. In some cases, you might not have all the information, but they're still very useful and good enough. So what I do with a trace loop is to add a trace point on the cease-enter trace point on Linux. So every time we enter a system call, this trace point will be triggered and execute this BPF program. And this BPF program will decide what to do with that system call. It will first look in which C-group it is, in which pod or which container, and then depending on the container, it will redirect the execution program to another BPF program. So it uses a BPF map of a special kind called program map. And depending on the C-group, it will redirect the execution to a different module and log the system calls in a different ring buffer. Those ring buffer are configured to be overwritable, so that's not the default on path ring buffer where it's stopped writing when it's full, but this one is just overwrite continuously. And then when the user asks for it, the user can read the last system calls, the last few system calls. So I can do a demo of trace loop on a command line. So trace loop is a CLI command line tool. Is it big enough? It can run as a demo, or it can run on the command line like this. Here, I will specify, you will trace whatever process is in a SSHD C-group. So let's try this. Okay, now it starts to trace. So I will generate some traffic on SSHD. I mean, I want to make it do some system calls. Okay, so now I did some system calls. As you see, you don't see any trace here because it doesn't actually look under ring buffer. Only when I ask for it, here with control C, I will get the last system calls from SSHD. So I see it does some system call like receive message, select, and so on. The last few system calls are printed here, and I can debug my application. Okay, so that was trace loop on the command line. Now I want to adapt it to Kubernetes. So what do we want for Kubernetes? We don't really want the user to SSH to an odd. Usually, the user doesn't really care about PID or C-groups, but they care about Kubernetes spots, or Kubernetes labels. So you want to be able to select the thing to debug with a pod or the labels. And then the user experience should be something close to kubectl, which is the command line tool for Kubernetes, and not having had to SSH on the node. It turns out there are already Kubernetes tools doing that. So on the left side, I have some BPF tools, BPF trace, BCC, and trace loop. On the right side, some Kubernetes-level tools that use the tools from the left. So that's a really cool project. For example, kubectl trace can use BPF trace on the Kubernetes cluster. On inspector gadget, use BCC and trace loop as well. Basically, it works like that. When you're on your laptop, you use kubectl gadget, which is a client-side plugin to kubectl, and then it really issues some API calls to Kubernetes. It will not do any SSH or any other things. It will only go through the API server and request some BCC scripts or BPF program to be executed on the node. So now it's time for another demo of trace loop on inspector gadget. So here I have some Kubernetes pods running on. What I will do, I'll prepare here a command to run, which will start a new pod with some shell scripts here. So I run my pod. It computes a multiplication here. And then you don't write my shell script correctly, so I don't see the result. But I still have a way to debug it. Even if I delete the pod that I just used, I have 10 minutes left. My pod is gone because I just deleted, but with inspector gadget trace loop list, I can see the list of the last few traces. And I should be able to find one, which is this one 28 seconds ago. I will do inspector gadget trace loop show on this. And I should be able to see the last system call that were recorded in the rig buffer. So I can see, for example, that this... The BC process and the shell process were doing the multiplication and printing the output, so I can debug my application here. So I will show some stop gaps in trace loop that are things that I implemented in trace loop, which are not the perfect way, but I will explain why. So initially, I bought inspector gadget for KineFox Linux distribution, that's a flat-card container Linux on Kubernetes distribution. But then I wanted to make it work on other normal, I mean, older Linux kernels, for example, like a mini-cube or GKE user 4.14, which doesn't have what the BPF feature I wanted. So for example, it doesn't have this BPF helper function that I wanted. It doesn't have C-group version 2 enabled by default. And there is no proper way from Kubernetes to use OCI hooks. But still, the trace loop program works on all those Kubernetes distribution with some hacks or workaround or stop gaps. So instead of using the get current C-group ID, this BPF helper function that is not available on older kernels, I use some custom way to get the ID of namespaces, Linux namespaces, and then I find that to identify which container I am looking at. I don't have OCI hooks, so I cannot add a new BPF module for each pod at the beginning of the container. And I cannot really use the Kubernetes API to discover a new container because that would be too late. By the time I get the notification from the Kubernetes API or for the Docker API, the container is already running. It's already doing some things calls, so I will not catch the very first system calls. So that was an important use case for me to be able to trace the very first system calls because maybe this container crashed in the very first second and I want to be able to debug that. So instead, I have a pool of BPF modules that are preloaded dynamically in BPF and link them to a new C-group as they appear. At this time, I don't really know yet which Kubernetes labels or which pod, which container they are, but we can reconcile that a bit later when I get the notification. So far I talked about Thresloop, which is one gadget of Inspector Gadget. There are other gadgets for different use cases. Some for debugging your applications. Some IDs, which are not finished yet, is working progress. I want to be able to see what network connections my pod is running and be able to help a developer to write network policies. It's like doing security as an afterthought. Sometimes we develop an application, we think about security later, and then we think, oh, we should maybe add some network policies or some PSP or something like that. But when the application is already developed, we forgot other architecture and it's kind of useful to discover what it's doing and suggest to the developer network policies. So that's not finished work, but that's something I would like to have. I will just do a demo of ExactSnoop on OpenSnoop. There are tools that come from BCC, so I just took the BCC scripts. So I have the same Kubernetes cluster, and here I run Inspector Gadget ExactSnoop, and I will specify the label of the pod I want to trace. And here, on my three node cluster, I will get every new process that I have executed with... I will see that with ExactSnoop. Okay. I have the same thing with OpenSnoop. It does trace every time I open a new file. So since I run some shell script, it uses libc and so on, and it opens different files. Okay. To be able to do that, I use something I call the Gadget Tracer Manager. This is the thing I want to be able to filter on. So usually I don't want to filter to get the information from other pods always, but I will select either with Kubernetes label or on a specific Kubernetes namespace or a pod name or node or container index when there are several containers in the same pod. So there are all the different way to filter things. Filtering by label is quite important to me because usually when you deploy a pod using a deployment on Kubernetes, you don't know in advance what is the name of the pod because Kubernetes creates the name with a random suffix. So I don't know the name of the pod in advance, and if I want to trace the very first system call, I need to be able to filter in other ways. So I use labels. So pods can come and go. So I don't know the name in advance, and a tracer can come and go too, and I need a system to link them together. For example, this tracer will get this information from those two pods, but not the third one and so on. So for that, the Gadget Tracer Manager is just a demand which has a gRPC API and some method to add or remove containers or tracers. So here I use OCI hook, pre-start. Every time there is a new container, I tell the tracer manager that there is a new container or when one stops. And when I run inspector gadget with some gadget, it will tell the gadget tracer manager that there is a new tracer or one is all stopped. And what the tracer manager will do is to update BPF maps. So there is one for each tracer, and this map contains the list of C groups that the tracer should trace. So the list of containers, basically. And then when I run BCC scripts, like a exact snoop, I will specify which BPF map to look at to know which C groups to trace, basically to know which pod or container labels I want to trace. So if you want to contribute to those, I just created a couple of labels. So there are some issues with a good first issue label on GitHub. That's things that are a bit more easier or are available to help. And there is this new inspector gadget Slack channel that you can join and we can talk there as well. Thank you. So now I can take questions. So Alvin, you mentioned your locomotive distribution. So what do you need that for? Is there anything you can do with that that you cannot do with anything else? So it doesn't have anything magic. That's just normal Linux technology. Twistloop works on others or older canals as well. But for the other gadgets, I need the last latest BPF helper functions such as getCgroupCurrentID. And I need... I did some hacking runC to be able to add or remove both CI hooks. That's... There are work in progress in cryo and container d, I think, to be able to configure that. But at the moment it's done in a hacky way until we get that upstream. So there are... I use CgroupVersion2 as well, enabled by default. So each container is in a different CgroupVersion2, which might not be the case if you are running other Kubernetes distribution. But all of that, there are Linux technologies, so you can enable them as well as well. There is nothing really specific about the locomotive or Kubernetes distribution. Okay, all out of time. Thank you. Thanks.