 So, my name is Alban. I'm in Berlin. I work at Kinfolk. I love Linux and VPF. At Kinfolk, we care about Kubernetes and low-level Linux development, things like VPF as well. So, I will not introduce too much eBPF. How many of you know about eBPF? Okay, almost everybody. I will just introduce it quickly and then I will go to a few demos and then I will explain how things work. So, if I just say a few words of eBPF, that's like small programs that run inside the Linux kernel and that can be used for security, for tracing or networking. In this talk, I will focus only on tracing. So, how to trace your Kubernetes cluster using eBPF. And I will talk about concepts like K-Prob, U-Prob, U-DSD and TracePoints. Using eBPF is safe if I compare to running other things in the Linux kernel like kernel modules. It's not like that. It's safe. It will not crash your kernels. And it's safe because there are restrictions about what you can do. And in the kernel, there is a verifier that will verify that your program, your eBPF program will not run indefinitely or do unauthorized access to some memory or so on. And it's safe as well because eBPF is a bytecode, but it will be compiled to native code and then that will run as fast as doing a normal function call in the kernel. If you want to learn more about eBPF, that's a good pointer to look at the BCC project for eBPF compiler collection. And if you use Go, you can look at Go eBPF as well. So, I will start first demo. This is about eBPF Trace. eBPF Trace is a tool that runs on one node. You can start from the command line and you can type a one-liner small eBPF program. So, it has its own language that you can type on the command line. It will be compiled on running on the node. So, this is not cluster aware on anything. And I will prepare a demo. Is it gone enough? So, here I have, okay, this one is not exactly a one-liner, but it's a small eBPF Trace program. What it does, it will attach, it will use a Uprobe to trace in the order bash process. Every time you type a command in bash, it will capture the function read line and it will print the return value of read line. So, that's the command. So, if I start this program, so far nothing happened. But I go to another terminal. I type some command. Yeah, okay. That was some commands. And here, I see the results. So, it says every command I type here are captured by a eBPF Trace. It does that on all the process of the system that run bash. It put a Uprobe on the specific function in the bash binary and capture that. So, the idea of a kubectl trace is to do similar kind of things, but on a cluster level rather than on a single node. So, now we'll do the second demo on kubectl trace. And for that, I have a bit more complex setup. I have a Kubernetes installation. I use Flatcar Linux as a base Linux distribution that we do at KINFOLK. I use the Kubernetes distribution based on KINFOLK with a special image to have the latest Linux channel because I need some BPA function in the latest Linux channel. And I will do a demo with the micro service application that was developed by two companies, Container Solution and Weave Works. So, I will show you right now how does this application look like. So, the application is a web app where you can buy socks or pretend to buy socks and click on different articles you can buy. And this is running on Kubernetes. So, let me stop that. So, I have a Kubernetes cluster with an in-space for this application and I have about a dozen of different containers. And here, I want to trace what happened in one specific container, the front-end. And before this talk, I prepared a few environment variables. So, I have the port environment variable that I want to trace is running on this node under a C group which is listed here. And now, I will copy-paste the command I prepared before. This one. What it does is I use scripted trace on a specifier on which node I will run that. And it will attach to the kernel function with a K-probe. The kernel function is do-sys-open. So, every term the system call open is called. The BPF program will be executed. And the BPF program will do something like printf and print the program name. And the first argument of the system call open. So, I have a Node.js application. So, the name of the program is Node. On the file, it's opening some files, but nothing important so far. On here, I don't want to trace all the process on the system, but only the process that runs on a specific pod. So, I added a filter here. That's something you can do with BPF trace. Here, I only want to print the trace whenever it's gone. I will show you. Let me stop that. I only show the trace when the currency group is the one of the container. So, now I start this again. And if I go to the Firefox and I refresh this page, here it has a lot more file that it does open. So, the node running in the container has opened a lot of files. So, that's it for this demo. So, to sum up, with kubectl trace, you can deploy a pod with BPF trace and inspect some pods with BPF. So, that was in case the network was not working, but the network is working here. So, that's the same thing. So, how does it work? So, kubectl trace is a client-side plugin to kubectl. So, it doesn't run on a Kubernetes cluster, but it runs on my laptop. So, when a type kubectl trace executes the plugin, this plugin will not do any SSH to the server or anything like that. It only uses Kubernetes primitives. It will create Kubernetes native resources like config map or job or pod or things like that. So, the first thing it does is it creates a config map with a content of the program. Okay. It creates a job that will be deployed on a worker node. The job is a trace runner pod that will run BPF trace. And that thing will install the BPF program. So, I will zoom in on that pod here to see what it does. So, this process will fetch information from the config map to get the program code. It will get the kernel headers from the host and compile it into BPF by code. It uses LLVM to do that. And then it will install the BPF program in the kernel with a BPF system call. With a BPF system call, we get a file descriptor on representing the program. And then we attach that program to specific hooks, to specific points with K probes or U probes and so on with a TraceFS file system. A lot of this is done with libbcc. To be able to do that on a Kubernetes cluster, we need a lot of privilege. For example, we need Capsis admin because a lot of BPF operations need this privilege. We need to have access to some volumes. For example, the TraceFS volume to get access to the kernel headers. So, in Kubernetes, we actually use some pod options like privilege equal true using the volumes and so on. If your Kubernetes cluster is configured to use pod security policies, you need to be careful to configure that correctly. So, you need to configure a service account on cluster roles and so on. So, that's the usual airbag things, role-based access control. With Capsitl Trace, you can do different kind of tracing in kernel or in user land. In kernel, you can do trace print. That's statically defined trace print that are defined in the Linux source code. Or K probes, which are more dynamic. You can give any kernel function and put a trace on that. Or in user's program, you can use UDST or Uprobe. Trace print work like that. They are statically defined. At the beginning of a function of this code, you might have a trace print that is executed or not depending if there is a trace print at the moment installed. And then it will execute the BPF program that will emit some events. K probes is very similar, but the code is patched at runtime. So, you can do that on any function. It will replace the first instruction by a jump. Save some register. Do the call to the BPF program and then return after having executed the original instruction. User level statically defined tracing. That's something here I gave an example. How to know if one of your program do that? You can use the read health command with dash n and you can see some trace defined here. Uprobe, that's the demo that I did before to read the command line from another process from bash. So, that's the four main thing you can do. So, when preparing this demo, my biggest challenge was how to do this filter here. So, if I show again, how do I do this filter? So, this was not implemented at the beginning and implemented that in BPF trace and in CUP CTL trace. So, to select the pod, the issue is that BPF program are installed in the kernel on a global day. They are not installed for a specific PID or C group or container, but they run globally. So, I need a way in the BPF program to check what is the current context. So, I look at the list of BPF helper functions and some of them are going to help us. So, this one get the name of the program, the com short name of the program. That's not really so useful because you can have that on multiple program and that is easily changeable. The most interesting thing is the C group ID, which is in the last recent kernel version. So, looking at the documentation of that, it returned 64-bit integer giving the C group ID. So, based on that, I did the implementation in BPF trace and the language specific to BPF trace to have a built-in C group that returned the current C group ID based on this BPF helper function and then another built-in C group ID that does the translation of, you give a path of a C group and it returned that. Another issue is in the kernel, we have two versions of C group, C group version one and C group version two. Yes, so both can exist at the same time if you select both, but that's a bit complicated to manage. The BPF helper function only care about C group version two. So, the problem is Kubernetes normally use only C group version one. So, I needed to do some change there. So, first I configured system D on the host to use both version one and version two with this parameter and then there is a configuration in Docker to say don't do C group yourself, but ask system D to do it and then, because system D is configured to do both version one and two, it will do that. And there are similar options you can give to the Qubelet and to container D. So, after doing that, the container will run with version one and two at the same time. That's how I can filter on the pod with C group version two ID. So, to finish this talk, I will give just three short ideas about things I would like to have in the future in this project that maybe I could implement or you can implement. First, to improve the user interface, because the way I did it here, I give the node environment variable, I give the C group full path. Ideally, there will be a way for Qubectl trace to get that automatically. So, that should not be too complicated to implement to this kind of automation. Another idea is to have aggregation. So, usually when we do deployment on Kubernetes, we have several replicas, different pods, running potentially on different nodes, and it will be good to gather the result of those different pods and aggregate the result on two kinds to finish. So, at the moment, Qubectl trace needs to have privilege on the cluster, but sometimes it's not that good that every user have complete access to the cluster. It would be nice if a user could inspect their own pod without being admin on this cluster. So, that's an idea, but without clear idea how to implement that, that would require a lot of thinking, I think. Okay. Thank you. Is there any question? Got time for a couple of questions. Great talk, thanks. Do you need the kernel headers to run BPR trace? Yes. So, I think it depends what kind of pop you do. If you use K-Pops, yes, you need kernel headers to know what things you inspect in the kernel. With trace print, I'm not sure. I think it should be possible without, but I've not tested anything else. Doesn't look like it. Thank you.