 One of the reasons I did this talk is that in my current day-to-day, I have to deal with Kubernetes clusters, Linux machines, and in general, whatever happens in those clusters, so it's a bit difficult sometimes to understand things like we create SAS database, so one of the things we need to know is, for example, how many bytes have been written in that file today. And it turns out that it's not pretty straightforward with that, these kind of things. So we started exploring a bit about how can you extract those information from the kernel. And turned out that then at some point we discovered that there's this thing called a BPF, so myself, like I didn't know about the BPF like one year and a half ago, but had this opportunity to explore a bit, so I wrapped up this talk to tell everyone else about it. So, a BPF stands for Extended BPF. BPF is basically a Unix feature that lets you get some information from the machine you're running on using a Cisco. It's historically stands for Berkeley packet filter, so it seems to everyone that is about just networking, but it turns out that it's not actually, at least not in Linux machines, because it has been extended to do many more things. In this context we are talking about BPF as a tracing framework. This is the use case that was focused on because it wanted to extract information from my machines, but it turns out that it can actually be used for other things. I will tell a bit more about the other things you can use also so that you know about it, but it's not what I'm showing here. BPF itself is not about extracting information or doing things basically like instrumenting the kernel, but it's something that you can use to instrument other frameworks that are already in the kernel. So, for example, the kernel has some tracing backends that you can use, static trace points, key probes, U-pros where you see them, so you basically use the BPFs to access those tracing frameworks. BPF itself is not a tracing framework, it is the tool that you use to instrument the tracing framework to go to the kernel and use the tracing framework, right? Instead of going there like with kernel module or with a syscall, you go there using BPFs. One of those tracing frameworks that is posted by the kernel is the static trace points. Static trace points are basically trace points that are fixed in the kernel that are already there for you to use. They are defined, if everyone can see, there's this folder where you can see by basically printing a file where you can see all the static trace points defined in your machine, and those have been defined by the kernel developers, so they are already there, they have their arguments, their returns, and you can extract information from them. We'll see how after. What's very interesting for everyone on the other end is to be able to define dynamic tracing functionalities, so those can be in the kernel or in your user space programs, namely key probes and U-pros, so key probes are the backend that lets you extract information from the kernel, so you don't attach key probes on a binary, but you attach key probes on the kernel itself, so your key probe can be attached to a kernel function. For example, if you want to see, as an example I made before, how many bytes are written for a file, you attach a VFS write. VFS write is the function that's basically called in the kernel. Every time a file is written, so everything uses a file, so that's it. And VFS read. On the other end, with U-pros, you can attach to your program's function, so we will see that in a program that I have. It's called Caterday Show Cats that has a main, in the main package, it's a Go program. It has a variable, it has a function that reasons a variable that is an integer, that is the counter of the request that this program received, and it is in your user space, and you can attach to it by attaching a U-pro. And then there's XDP. This talk is not about XDP, but I want to tell everyone what it is. XDP is Express Data Path, and this framework built on top of a BPF to allow you to basically do packet mangling and manage your packets, so you can do different things like firewalls, but it allows you to do more than just, like, you know, reject the packets in kernel space. You can reject the packets directly in the network card so that you don't even populate the task struct, the socket buffer struct for the packet that gives you a great performance improvement because the kernel didn't even process the packet. So this BPF thing is all about doing the processing of your information in kernel in the user space. There's some mechanisms in BPF that's called maps. There are a wide set of maps that you can use that basically allows you to aggregate information in kernel level so that you don't overwhelm your user space because the kernel can easily deal with, like, thousands of packets, for example, that you want to analyze, but it's better to not send them all of them in user space if you just want to know the bytes at the packet have been written in 10 seconds. You can do that in kernel space. The lifecycle of an BPF program is quite simple. When you have seen it, initially it seems a bit crazy that you have this situation, but the idea is that you have a program that is written for a specific instruction set that is the BPF assembly, basically, instruction set that runs on the BPF virtual machine. The BPF virtual machine is implemented in a kernel, and it has a thing called static verifier that is a process in the kernel that basically ensures that you don't kernel panic this machine. So as opposed to writing a kernel module, that, of course, gives you, like, more opportunities in terminal customization of the kernel, any BPF cannot kernel panic your machine, or in general it doesn't allow you to do bad things. And this is both a nice thing but also a bad thing because sometimes you just want to do things and it doesn't allow you, but it's nice that there's something looking at us. And then the BPF is called there, so basically you have this BPF bytecode that you can compile with, for example, Clang as a backend that's called BPF, so you clang emit BPF and you basically can write a C program that emits a BPF program and then with the C is called BPF you load the program. All these things seem crazy, but there are higher level of programs that let you do things. And then by passing this argument to the BPF C is called, you load the program, the static query file reads the program, ensure that everything is okay, and then the BPF C is called the BPF VM. Understand what kind of program this is, a K-Probe, a U-Probe, static trace points, whatever. And if it's the case that the program registers itself to a map, that it's what lets you get information back in user space, it's basically your communication medium to get out, like, for example, you, like, contract is a thing, right? You call the implement contract with this thing, with the BPF. And to get back the information from contract, for your own contract, you will have to use a map that gives you a stream of the connection that would be open. And the mustache part there says, maybe if your program are not doing complete, so basically you cannot do loops, so that's because of the static query file doesn't allow you to do that. But if your loop is, for example, like to, like, you have a list of server, or a list of IPs for your network interface that you have to control, these kind of things, if it's a fixed loop that you have to do like it has a set of elements you already know at compile time, you can now roll the loop with a pragmat goal in your compiler, for example. Or you can just write down 10 times the thing. And what about today? I mean, seeing that this thing is pretty crazy. Some of you know this. I started doing this talk one year ago, and well, maybe it's because it's, here is false then. So the people that come here are more like on the edge of this kind of thing, especially for the Inox channel. But in general, no one knows what are the applications today. One of the most common is TCP-DAM. Who uses your TCP-DAM or uses it? Everyone, basically. Not you are there, but I see you. TCP-DAM is a program that lets you, basically, dumps out, it dumps out the packets that you are receiving in this machine or a specific network interface you are pointing to. The TCP-DAM program that I'm using here basically gets all the IP packets and the IP before. And TCP and on port 80. So this will be HTTP traffic. And if you pass the D parameter to TCP-DAM, it dumps out the BPF program that it's loading to do that. So TCP-DAM actually uses BPF program even if you didn't really see them. So BPF are a mainstream thing now. Even if maybe you are not yet like implementing them in your programs, there are things for everyone. And the things that dumps out here is the instruction set that the portion of the instruction set that this is using. And if you look at it, it's basically like a reflection of the RFC implementation, right? So these two instructions mean... Is this an Ethernet and a TP-4 packet? I don't do this thing, like, automatically. So, like, if I will not have written this year, I will not know. But I did the math at home and wrote these things out. And is SRC-X plus 14 non-parts 80? So 15 hexadecimal. And same thing for destination. So since I didn't specify here SRC 80 and destination 80, this is assuming that it's always 80. And it's basically adding two instructions, both for source and both for destination. Who uses Kubernetes containers, Docker, whatever? Raise your hands. One of the main... One of the main... The main, basically... Isolation... Isolation technique that container runtime uses to basically allow you to share your kernel between your container and the OS machine is Secom. What does Secom do? Basically, like, if you go to the Docker repo, you will see... It's a mobby, actually, repo now. You will see that it has a predefined Secom set of rules that basically doesn't allow you to capsize admin so you don't become a network administrator and these kind of things. And it turns out that Secom has an eBPF subsystem that you can use to define your own, like, rules as compared to, like... Because Secom lets you, like, just define rules. Say, block this syscall, block this syscall, block this syscall. But it turns out that with the eBPF subsystem you can actually write your own rules. If ProcessID is odd, block the syscall. These kind of things. Probably the use case I just told, it's probably a bit silly, but it's something you can do. And in this situation, I just compiled this program here. And the program I did is printing gray there, and then... And then it is installing a filter with this function I defined here. As you can see, I'm basically writing... using a DSL to write the eBPF assembly here. You are not... You don't really need to do that. I didn't really know what I was doing. It took me so much to end up with that thing. But it turns out that, like, I could have written this in C and compiled with Clang and copied it here. And it's basically the process that I used, but I really want to be very cool and write that in the assembly. So that's what I did. And then basically I load that... I load that eBPF programming second with those instructions. And this filter is installed in the NR write, so I'm blocking all the writes to everything. And on this architecture, and I give back an ePerm when you try to write. So everything I'm going to write after this second registration is going to be denied. So I just started that program I made with S-Trace, just to see. And the first call went okay, and all the others just gave a pair. And the exit to the program. What are more practical examples? Trace file opened by file name. I've written a bunch of them. Go around time events. For XDP, firewalls, and packet rewriting. Trace commands are in a bash shell. Write a key log or whatever you want to write. One of the use cases I use a lot is, since I was running Flux, Trace query is done against the database. Like, I don't do the core of influxCB, so I don't really know where to... I don't really want to mangle with their code because the core engineer doesn't allow me, but I really want to do that. So I just load an ePPR program to see what's happening. Yes, it's crazy, and no one's ready to do all of that. This statement is probably through another conference. It's not that false. So higher level APIs are not for false people, but let's see them anyway. An interesting project is Iovizer Go BPF. If you're a Go programmer, you might want to just load your BPF program using Go, actually. Go BPF is basically a binding that lets you load BPF programs and compile them and load them, so you just provide the C program, as I did here, like a u2read file, bashreadlinec. I will go to that program in a second. Or you can give Go BPF the program already compiled. The difference is that if you ask Go BPF to compile the program, in the target machine you will need to have the compiler basically, while if you give the program already compiled, you just have a simple Go binary in the target machine. But the real innovation in this repo is that it allows you to register to the maps that I told you about before using Go channels, so you can actually read the maps in a concurrent way. Let's go very fast through this program. I just read this bashreadlinec here. What is this doing is that it's basically doing some verification that are required for the static verifier to let your program work. And then I get the current process ID. I register it in this event that I created here, that it's coming from this struct that is also registered in my Go program. So I have to maintain, when I send the data with the map, I have to maintain the struct both in the BPF program and in my Go program. And then I register this BPF per file to a map, that BPF per file put is the map that you want to use when you want to have a generic map to send things, like if you don't want to use a specific one, this one is generic and it's always registered and works very well for debugging use cases. So BPF per file put here, and this is the same real-line events here and real-line events in this case here. So I register to this table and every time the function readline is called on bash, I send an event. So this thing is basically, you start a bash shell everywhere in the machine and every time readline is the function in bash that's called whenever command is issued. So when you write a command in bash, readline receives it, readline is a function, so you're getting the argument of the function and you send it to the map and the map is received in Go here in this table and then this table has a channel and this is connected to the table and then you just get the information out of the channel using the for loop here and then you print them basically out. So if I load this program in an Linux machine, what happens is that I just see all the bash command issues. Another interesting project always in Iovizer is BPF Trace. BPF Trace is an higher level language that has been written for the purpose of doing BPF tracing and it's a bit easier because you don't have to deal with all of this. This is still very good because it's very extensible like you can write your basically your own Go program having it interact with the BPF program but if you just need some specialized tool like just extract all the right, maybe you don't need to do anything else with them, you just want to see them, it's easier to use BPF Trace. I'm always talking about this Iovizer because it's the Linux Foundation BPF project. It's basically right now the main contributor to BPF, other contributors that are very notable are me, and I'm joking, and are Celium, that is basically doing, they are doing their SDN for Kubernetes but they basically wrote the documentation for a BPF. Like if you write the BPF documentation in Doc.gov, whatever search engine you use, I will not, you basically find them, or, well, that's what I found actually. I didn't find anything other documentation else. And other contributors are the guys at Cambridge, we have one of them there, and all the others. I will tell them about that later. But Iovizer is the main contributor actually and Brandon Gregg, that is working on Netflix doing performance, created this graph here showing you all the interaction point with the BPF using BPF Trace. So you can easily see that I was always talking about VFS, so if you're interested in the file system, you can go as down as you want, down to the device drivers, or sockets, same thing for the scheduler, for the virtual memory, you can interact with assist calls, you can interact with the system libraries, up to the applications. And you can connect to all of these things using the trace back end I said. Trace points, hardware, profile interval, K probe, U probe, USDT. USDT is very interesting, I didn't add those in the slides because you don't want to make this all very long. But it basically allows you to define static trace points in your programs. One of the most notable project that uses this is Node.js. Node.js already has USDT is defined so that, for example, you can get all the argument to Node functions using USDT, because a Node program is not like compiled to an object. So how do I do these kind of things with an interpreter language using USDT? And what about Kubernetes? Yeah, finally. Turns out that there's not really anything for Kubernetes. There are a bunch of projects around, but not a lot of projects that are, like, in a group that actively maintains them in a way that they are using, used very broadly in the community. With my friend Leonardo there, we had an idea to use BPF Trace against Kubernetes clusters, and we created this plugin called QCTL Trace. Then Brandon Gregg noticed it, and we contributed that to Iovizer, basically. So now it's Iovizer QCTL Trace. I will have a demo shortly. There's a little disclaimer. This is not my laptop. That's very funny. I left my laptop at home, so this is Leonardo's laptop, so if you see Leonardo around, it's not my fault, and if it doesn't work, it's Leonardo's fault. So, yes, of course, you can all punch Leonardo in the face later. And so you can find QCTL Trace in Iovizer. The philosophy behind BPF Trace and QCTL Trace is that they are UNIX tools following the UNIX philosophy, so you just run them, see the results, and they're gone. It's not a kind of tool that you can use to let your program run there forever or that you can look to other things. It's just, hey, I want to see all the file written. I want to see all the connections. You see them and they're gone. The usage is QCTL Trace, run BPF Trace program, and then, since we are in Kubernetes, your node or your pod. So you can attach a program to a pod specifically or to a node, and then you just attach a TTI every day. An interesting use case, it's seeing, for example, the distribution of reading a file. Like, you're reading a file, right? But you never know how you're reading it, right? It's a question that the first time I asked myself this question is, what do you mean with how I read it, right? I just read it, right? You don't just read a file. You can read it in different ways. You can read it in chunks of, like, one gigabyte or in chunk of two kilobytes, right? And that gives you different performances. So, like, if you're writing a database like we do, and you're reading or writing files a lot, this is a very huge difference. Like, if you write a cache to a file in chunk of one gigabyte or one kilobytes, and this tool gives you an insight of that, BPF Trace, and by extension QCTL Trace and clusters. And since with QCTL Trace or BPF Trace, you can define your output format. You can, like, write CSV, and then you can pipe it, as I said, it holds the Unix philosophy. So you can pipe it to another program, like VZData, that lets you plot your results in a different way than QCTL Trace does. So QCTL, BPF Trace has functions to aggregate your results, like Instagrams. You can see the maps as sums, these kind of things. But if you want to aggregate them yourself, you can do that in user space. And there's a demo time. Leo, you think it will work? Who knows? I tried to do my best to be comfortable with this machine. So I have this Kubernetes application called Caterday that shows cats. Seems fine. And basically it's a department with three containers and runs this Caterdating that has an HTTP server that shows you cats in the browser and cats in the terminal with the row and point. So just run it. I see my cats. One of them are already running and see that there's a service where I can reach cats and it's over there. Then I have to use bash because you think that would be crazy. And this Caterday program is implemented in this main Go file. And, yes, it has cats. And what I'm interested in is getting out the value of this counter value. Counter value is called, we see in the cats now, it's called every time someone calls endpoint and shows cats, if it's for the normal endpoint or for the row endpoint, so basically we see one, two, three times. It's an atomic counter showing how many times the cats are being showed up. And I just want to be able to extract that information without having like, well, you can monitor this like adding a Prometheus endpoint or adding an influx sign and sending information out or log into a file and send them somewhere or you can extract information using any BPF. And that's what I want to do. So I have the cats here. I have a pod. And what I want to do is this one liner here. So it's big enough. And I want to do QCTL Trace Run. URatProb doesn't give you the return value of the function. URatProb gives you the return value. KeyProb and KeyRatProb, same thing. So I attached this URatProb to my... I don't really know where the binary is in the container so I just use its process ID and get the binary from XF. And I want to bring the counter and I will get the value from this main counter value. And the pod is this one. So pod there, one of them. And this one. And I just use it here. I start this thing. When it started, I can do QCTL Trace Get. Oh, that's poor. Oh, maybe. It's better to start it here. I didn't have the variable for Kubernetes. And, okay. It started saying, hey, I'm waiting for you to send a Control C because Control C, that signal is used to tell the program to dump out the maps in BPF Trace. And... So I can do Trace Get, not my keyboard, in the namespace. So the experience is very similar to the normal QCTL programs. And they say that I have this QCTL trace here and they see that it's attached. Now, I just want to, like, do a call to those... to that catsend point that is in the net or namespace. And, okay, this is the HTML endpoint and I just want the row endpoint. So you see that I have this counter with three cats here, showing four or five. And I have this number here, too. So every time I call the cats, it gives me the counter-increasing in from the BPF, extracted from my BPF Trace program, started from my laptop, and the cluster might be anywhere. So it's a very good helper to accept information this way. And other one-liners I prepared for these are... like this one. Same node. In this case, in another node, they want to see how many times the C center wildcard C's call has been called. And I'm creating here a map called probe and parentheses is create a map and the common function lets you see how many times this thing has been called. And same thing. I will wait for it to run, attaching 300 probes, because C center's is called a lot, and then they just plot the results. So C center socket 435 times in one second. Right? The first thing one is counting BFS writes. So this is what I said before. Every time a file is written in the use kernel, it goes through the BFS. BFS write here... I just... well, this is not C. I can import Linux orders in BPF Trace using the same notation. I can create structs using the same notation which is easier for one to start the information. So I extracted the file information from the file descriptor, and then I create a map that has file name and the sum of bytes that have been written on this node and they attach. And then I just go to the Internet and do some things, just because it creates some... Internet is running on this machine, and this machine is under... is under control, and this is not my github, because it's Leonardo's github. And I just control C here, and I say... I see that there's some stuff running. I see that TCP v6, because the Falsum Internet is in IPv6, so it's not TCP. And I see that Firefox is doing a lot of things with SQLite and everything. And another interesting thing is seeing the same thing with the histogram. So let's still go around, open Catterday, whatever, and control C. In this case, I'm seeing not just the bytes written, but also the distribution of the bytes written per chunk. So like for this cookie, SQLite, well, I see that it's between 16 and 32 bytes, or between 32 and 64 kilobytes. While if I go to TCP, it becomes clear that the TCP RFC has been implemented correctly, because it ranges from 16 bytes to 8 kilobytes, right? So if you're implementing your own TCP, you can debug that with this. Who doesn't do that? And turns out the resources for a BPF are starting to pop out from the Internet, but there's not really yet a book for it. So David Calaver and Jesse are writing one. And this book is coming out from really sometimes this year or next year. I don't know, it's not yet public yet, but the book is coming. And as I said, there have been contributions from my advisor, Celium and other entities. Linux security models use this BPF thing, and there's slightly references for you, so if you want to take a photo and check the references for yourself, and it's everything. I bought a nice domain, it's called BPFSH, nice. If anyone wants an alias just write me an email, I will be happy to give it to you. Just because this community want to have more people looking at BPF and having an IC mail put you in a situation so that you have to do that because you have the email address, right? I'm joking. And if there are any questions, it's shut down. Do you want to use mine? I think I can take a quick report. Do you have any questions? You would have to speak out on the floor here. Unfortunately, the microphone is... You would have to repeat the question for the recording. If I understand it. I don't hear it. You can come if you want. I don't mind sharing my my thing here. I don't hear you. Probably I have no working years. But it's not working. Tell me. Well, you know the question right now. My question is... When you are in quick control at trace you target the node, right? Is it possible to filter for specific containers or for specific pods? Yeah, I did that in the demo. Yeah, sorry. Maybe it wasn't clear. I can show you that. So the question was it's possible to target a specific pod because yes, it is. Like in this case here I was using pod slash name of the pod instead of what is happening instead of the node. It turns out that this support is not very well done yet because we have to do some changes to VPF trace to support PID namespaces. It kind of works, but it's just to make it better. And there's actually support for C-group V2 in VPF trace that you can use with QCTL trace but C-group V2 are not the default in Kubernetes. So I didn't make an example just because it's not. Alban there is doing the talk this afternoon that showed the C-group support very well. So yeah, we found a microphone. Nice. Can you hear me? Any other question? Thank you for your presentation. Really nice. I'm really interested in this but I'm also scared. Is there any chance I can shoot myself in the foot? I want to test this on a server and then I do something stupid I run out of bandwidth out of CPU or something and I cannot connect to the box anymore. Is it possible for me to... It's possible with VPF in general like for example, well the only use case that happened to me that I basically locked myself out of the machine has been with XDP because I was managing with packets so I just basically dropped all the packets and my set of packets too. Like the same thing to do with IP tables. And that happens that you create a loop with packets continue going to the same interface and you basically lock yourself out. But VPF programs are not persistent in that way you have to load them again so you just restart the machine and with VPF Trace it doesn't allow you to do this kind of thing so now with VPF Trace but with VPF in general. Yes. Next question. We have two. Great presentation. I think it's very useful tools for my mostly Kubernetes administrator so I will use some of your tips but my question is, have you used SysDIC or could you compare this solution with SysDIC? Well, yes. I use SysDIC too. SysDIC runs as a kind of module. That's one of the things that one provides personally my machine likes to run in modules for Kubernetes so I was starting to explore this thing. I think that SysDIC does very well the presentation of the data they extract but I think that those data can be extracted also with the VPF program so like it would be cool if there was a VPF tool that extracts the same information as SysDIC, this microphone is driving me crazy but that has a nice interface. That's what we mean. That's the same reason I did this because I just wanted to have something that just let me do stuff like without having to say she knows my machines and run the program also because I don't really know where the things are running with Kubernetes. So, yes. I like it so much. A circle of the microphone to Florence. Thanks for the question. We have a question over here. Thank you for the presentation. It's still using classic VPF not eBPF. Do you know if and when the eBPF support will be extended to succumb? It's a tough question. I've been reading recently on the LKML that it's arriving. I don't think it's coming this year so honestly I don't think that's in progress. That's like measurable in a way that you can say hey, injury it's arriving but that's all I know. I'm not involved in that. Any further question? OK. Thanks for the question. And thanks for qualifying that classic eBPF. So, can you aggregate data from around the cluster or do you need to target a specific node or a specific container? Next question. Not yet because VPF traces with QCTL trace writer asking or with QCTL trace not yet because VPF trace has been created specifically for single nodes so you cannot but it can be done we are working on that we are working on having VPF trace expose a different format to aggregate from multiple nodes but yeah but that's in my personal plans at least because I'm probably doing that if anyone wants to contribute that I will be super happy it's mostly in VPF trace because QCTL trace just runs that I can run the program in multiple machines so it will work right now with streams but it doesn't work with aggregation because aggregations are basically done in that machine directly Hi, thanks for nice presentation more of a general VPF but also maybe can be applied to Kubernetes as well how safe is it to run a tracing program on a high load production server to do real analysis of what's happening is it going to affect performance and if yes is there any numbers that you can expect of course it's really probably difficult to answer depending on what the program is doing the question is how safe is it and can you actually run this in production to see how it works thanks for the question it's a thing that I missed saying so thanks for the question what do you usually do to know what your program are doing in general is like you use a debugger or you use S trace those tools actually put interrupts or dwarf symbols or whatever in your program so they need to interact with their program in a way while VPF doesn't VPF goes to the kernel the kernel receives all the information from the program already and you just attach yourself on a side and get the information the kernel already has so the performance impact is not it's always the program itself it's more in the kernel because the kernel needs to process your code but it's very it's very low like you will never well you can do that but you will never attach like S trace in PID 1 while you can do that safely with an VPF program thanks for the question that was a thing that I missed saying we have another question over here any other question okay then thanks for the talk thank you