 Good morning everybody. I will talk about using BPF iterators to gain insight into Kubernetes. My name is Albon Crackey. I am a Principal Software Engineer at Microsoft. And I joined Microsoft with the acquisition of Kintvok. I'm working on a team focused on Kubernetes on the BPF. And one of the projects we work on is the Inspector Gadget. It's a plugin to CUP CTL that uses the BPF to get insight into Kubernetes. And we recently made use of BPF iterators to write new gadgets in the Inspector Gadget. So in this talk I will talk about the timeline from classic BPF until today with BPF iterators. I will explain a bit what it is and what it can do. I will show some demos using the examples in the Linux sources and what's missing to make it useful for Kubernetes. And then I will show in Inspector Gadget how it works with the Gadget collectors. And then I will briefly mention how it interacts with namespaces and how Inspector Gadget enriched the container metadata and how it does filtering. And at the end I will do another demo and explain what's next to do with BPF iterators in Inspector Gadget. So BPF is not something new. It started in 1992 where BPF was used for TCP dump with socket capture, packet capture. And then in 2013 there is this new extended BPF which brings new BPF maps to interact between the BPF program and the user space. And there are a lot of new BPF program tags. So it's not longer only about packet capture but about tracing or security with different methods. And last year with Linux 5.8 there are new BPF iterators that get into the kernel and a few months ago the Go CDOMI BPF library supports BPF iterators. So the last fix we're in to be able to use it properly. So as I mentioned BPF have a lot of different kind of programs for networking, for tracing, for security. It could be attached to different kernel objects like LSM, 2S1, K-Prob and so on. And the new one is BPF iterators that could be used to iterate over different kind of kernel objects. For example we can iterate over every task. That means every thread on the Linux system can be iterated upon to list them or on TCP sockets or UDP sockets and so on. This is an example of BPF iterators to iterate over threads on the Linux system. As you can see it's a short program, 25 lines of code. And what you can see is line 8 and 9 it declares that it's an iterator to iterate over tasks. And then there is this new BPF helper function called BPF segprintf that will print the output of this BPF program. And this function will be executed for as many times as there are threads on the system. And it will print one line for each thread. Ok, so now it's time for a demo. Let's see in the terminal. Here you can see I am in the Linux sources. And we will first go to the location where other useful examples are for BPF programs. So in tools, testing, segprintf BPF. Here you can see there is a BPF program. I will look at all the files called BPF iter. That's all the examples you can see. And I compile them. So one of them will start with BPFiterTask.org. This is the BPF program I showed you before to iterate over tasks. So now I can load this BPF program using BPF tool. And I will load it and pin it on the BPF file system. So attack as input BPFiterTask.org And I will attach it to this FSBPF. I created it directly for this demo. And I will call it the same way BPFiterTask. So now there should be this file here that you can see here. And I can execute the iterator just by using cat on this file. And it will execute the BPF program as many times as the threads. Print one line for each of them. And here it prints the process ID on the thread ID for everything on this system. Okay, so that's useful already. But then I will show you how I can add more data to this. So I created a copy of these programs. So in BPFiterTask I created a copy called kubicon. And here I just added the kubicon line. And I added two fields for the command line. And one line space to print the one line space ID. And when I attach this I will be able to execute it the same way. So I will use the BPF tool in the same way. But I will take my copy. And then I can execute this in the same way. So now I could add any data I could find from the canal. When it's in task structure I print the command line on the one line space ID. From this I will show you how it's different from the classic PS command. So PS can add the process on the additional information. But it's done by looking at a different file in slashproc. And it's done in a less efficient way because it needs to open a lot of files in slashproc. And actually if we count the system calls to run this PS command. There are more than 5000 of them here. So if I compare that with just using the BPF iterator. You just need to open the BPF iterator on a read from that. And that's a lot less system calls. Now I will show another iterator on TCP sockets. So same as before I use BPF tool to attach it on the BPF file system. And then the CAD command to run it here. Here you can see it prints one line for each TCP socket. And the IP is printed in hexadecimal. So that's a bit difficult to read but here it is. And what I want to show you is when we run the BPF iterator it matters which name space we are in. So here I use CAD directly on the current network name space. But if I run it from a different network name space you will see the result is different. So I use the nature command to create a new network name space. And here you can see the result is empty because in this new network name space there are no sockets. And in the same way for tasks. So if I create a new pd name space I will start it. Here there is only one process, the process of CAD itself. So I can show it here, the CAD process. So it really matters which name space you are in to be able to see the right information. Okay, so let's go back to the slides. In summary I'll show you where to find the example in the links sources. How to use BPF to learn CAD to load on test BPF iterators. How it can list processes or how we can add more metadata. Here I did the command line, the command name and the money space ID. How to list TCP sockets and what's the difference in how many systems are used with the traditional commands. Okay, and if we want to use that in Kubernetes we have to add more things on top of that. When we use Kubernetes we care more about high level constructs like ports, Kubernetes namespace or labels. And we care less about specific PID or specific IPs. So we want to enrich the data we get so far with additional container metadata. To know for example which Kubernetes port does this process belong to. Which Kubernetes service does this IP belong to. Next we want to be able to filter the information. Not as before, take all the information on one specific node. But filter by Kubernetes namespace for example or labels. So now I will demo this on the terminal again. So let me start this again. I will show first the Kubernetes cluster. Where I deployed an example application in a SOC shop namespace. So here I will use the kubectl gadget. And then I can call this new gadget called process collector. And here I can specify which namespace I want to look at. And here I can see other processes which belong to a specific Kubernetes namespace. And I see for example in this rabbitmq pod there are certain number of processes. And I can add a filter using labels like for example rabbitmq. And here it filters only on the expression here. And I can do the same with another collector, the socket collector. And here demo effect it doesn't work. And here it is fixed. I run the command again. Here you can see the socket collector on the list of TCP sockets and UDP sockets in the specific Kubernetes namespace. And I can filter by label in the same way. So here you have seen the process collector and the socket collector gadgets get this information from BPF iterators with a very similar code as the example from Linux kernel. And then it add more data about the specific Kubernetes namespace, Kubernetes pods and so on. And it filter the result to only print when the information belong to the correct containers. And about what I explained before with namespaces, it's matter in which namespace you collect information from. So for the process collector, since PAD namespaces are hierarchical, we run the BPF iterator only one time in the host PAD namespace. And then in the BPF program we filter the information to only print the output when it's in the container that matters to us. But for the socket collector, since the network namespaces are not hierarchical, we need to run the BPF iterator one time for each network namespace we care about. So in practice one time per pod. Here is an illustration of PAD namespaces. That's the triangle with the processes in yellow. You can see the inner PAD namespaces are inside the host PAD namespace. But network namespaces are not hierarchical, so they're kind of isolated like this. How does container metadata arrangement works? So we use a BPF map, a BPF hash map, where the keys are the moon namespace ID. And then the value contains the list of strings like the container ID. That's the 64 hexadecimal digits that we get from, for example, Docker. The Kubernetes namespace and the Kubernetes pod name and container name. This BPF hash map is updated and maintained up to date by the gager-to-server manager. And then when we execute the BPF program of the iterator, this BPF program just reads this map to know how to print the information. Okay, here is for example how it's done in this BPF iterator. So first we get the task struct for the thread we want to print. And then we look at the moon namespace ID. And then we look at the map, the containers map. We do a lookup with the key being the moon namespace ID. And then we can print the information we can about. I can also show you this BPF map, how it works under the hood. So let's go back to the terminal. And now I will execute this command to enter the gadget pod from InspectorGadget. And look at the BPF system. We have this container BPF map. And you can see inside we can dump these things. But first I want to get the moon namespace ID for the rabbitmq pod. So here I enter the rabbitmq pod. I look at the moon namespace ID and I get this number. And from here if I want to I can dump this map here. I see a list of keys and values. If I look at the keys with this correct moon namespace ID, I will be able to see the different strings here in Exile Decimal with the pod name and container names and so on. So that was a short insight how you get the information from there. What's next with BPF iterators? What I would like to have in the two gadgets, the collector gadgets, is to be able to get the stack trace of other processes. And there are examples how to do that in the Linux command, in this example. To get the list of memory pages, there are an example for that as well, how to get the list of files open on a system for each pod. And then in the socket collector, instead of printing the IP address directly, I would like to resolve them to know if it belongs to a specific Kubernetes pod or service. And then we can write plugins in headlamp, which is a Kubernetes web UI, and we can see the result of inspector gadget plugins. That's what I would like to have implemented. Additionally, I would like inspector gadgets to be more modular so that the data can be available not only as a kubectl plugin, but available on other projects so that maybe the Go package can be reused directly and make it useful for other projects. Thank you.