 Hello, everyone. My name is Alfonso Acosta. I'm a software engineer at Weaveworks. And today I'm going to be talking about high-performance Linux monitoring with EBPF. But before I start, I would like to get a feeling about what kind of audience I'm going to be delivering my talk to. So who knows what TCP dump is? Raise your hands. Everyone. Just like I expected. Who knows what BPF is? Almost everyone. Oh, we have a high level here. Who knows how to code in C? Everyone as well. Okay, great. And who knew what Weaveworks is before coming here today? Okay. All right. So as I said, I'm Alfonso Acosta. I'm a software engineer at Weaveworks. And we're a startup company whose ultimate goal is to simplify the development and operation of microservice-oriented applications, which are typically containerized. And one, we have a software-as-a-service product called WeaveCloud. And among other services or cloud offering, we offer something called WeaveScope, which, among other things, does visualization of communication between containers, networking communication. So in order to do that, we need to track TCP connections in real time. So in this talk, I'm going to be introducing BPF, or what we call today classic BPF. And then I'm going to expand on EVPF and what that brings to us in particular on our Weaveworks use case. And finally, I'm going to talk about how we've incorporated EVPF in WeaveScope in order to do high-performance monitoring of TCP connections. Even if it's a short talk, I would like this to be interactive. So please, if at any point in time you have a question, ask it right away. They'll be handing over a microphone. So to start with, I'm going to be asking you a question. Can anybody tell me what this does? Everybody has used TCP dump, right? So what does it do? It dumps HTTP traffic. And furthermore, can you tell me how this works on Linux? Anybody can tell me how this works? Yeah? Can you speak to the microphone, please? Well, I should try to match traffic. I don't know. All right. But yeah, somebody's going to elaborate on that. I'll give you the details. TCP dump opens an AF packet socket. Get it raw data. Then it pushes down a BPF program, which is a small program that says, look at offset 10 and look for the protocol number and then go to the other offset. OK. Cool. Yeah, that's how it works. So he knows all the details. Basically, how this works is instead of forwarding every single packet from kernel space to user space and then based on the packet you filter on user space, it happens on the kernel. And we do that through something called BPF. Otherwise, it would be super inefficient because you would need to forward every single TCP packet to user space and filter it there. So in 1992, the Berkeley software distribution, Unix BSD, introduced something on a paper which is really, really interesting and I encourage you to read. They introduced something called Berkeley packet filter. And its goal is to filter packets in the kernel based on a virtual machine, on a program, a bytecode program executed by a virtual machine, which runs on every packet for a given networking interface you choose. And it decides whether the packet is filtered or not. In that way, you don't need to pass every single packet to user space and decide the filtering there. So it's really efficient. And just for the sake of showing an example, let's do that here. If I do this, I'm executing TCP dump. And, uh-huh, I don't have a network connection, so I'm not going to be able to show you that. But anyways, yeah, I'll do that later. So in practice, this is how it works. TCP dump uses a library called the pickup, which works on Unix and also on Windows, but on Windows it doesn't use BPF. It passes something we call the pickup filter with the syntax which says, hey, I want packets which are from protocol TCP and destination port 80. LeapPickup compiles that expression into BPF bytecode. It's injected in the kernel, and the kernel starts applying running the virtual machine in every single packet and based on the return value of that bytecode, it will filter the packet and pass it back to TCP dump. And then TCP dump will dissect it and show it to you. And actually, this is the assembly language of BPF. It's a limited virtual machine, but still a virtual machine. And in fact, if we want to... They were requesting a bigger font, so let's make it bigger. Is it big enough now? Almost. Is it big enough? All right. Maybe you didn't know this, but if you pass the D flag to TCP dump, it will output the assembly language of the BPF filter, which is passed to the kernel. And this is it. We're not going to go through the code, but basically what this does, the virtual machine has different addressing modes, and it has a scratch memory region, but the main memory region, if you compare it to a normal CPU model, will be mapped to the packet in every single execution of the VM. And I created a couple of sample programs, which do exactly what that TCP dump execution we saw, but coding the BPF filter by hand, so that we can see how it works. Let's look, for example, at the... Yeah, we're not going to look at it. I actually have it locally, so I think I have it locally. Yes. What we have here is we're loading the Ethernet header plus 9 bytes, which will give us the protocol of the IP package. Then we compare it against the type of protocol we want, which is TCP. If it's TCP, we will continue checking. If not, we will go to the rejects section, so on and so forth. And the same kind of program works for OS X. So in this case, we're using a little bit more sophisticated addressing modes, but it's basically the same filter. Actually, out of curiosity, just as an anecdote, I don't know if you can see it here, but the VM has a really, really, really specialized addressing mode, which is multiplying, getting one byte from the packet, getting the lower 4 bits, and multiplying it by 4. Anybody tell me what's the purpose of that? Knowing what you know about IP, since everybody knew about TCP.damp and networking protocols. Any guesses? Let's let him answer. Yeah, maybe what? Okay. I think somebody in the first row knows the answer. Uh-huh. Right. Yeah, that's the right answer. So a well-known field in the IP header is the length. And the length is expressed in number of 32-bit words, right? So what this does, if you place here the offset of the IP length field, it will get the lower 4 bits. The length is expressed by 4 bits and multiplied by 4. And that will give you the length of the IP packet, just so you know how specialized this is. This is completely specific for networking filtering. Okay. So now we know a little bit about what BPF is, how maybe not knowingly you were using it when filtering packets and investigating what was happening on your networking interface. But now let's talk about EBPF. EBPF stands for Extended Berkeley Packet Filter. And actually, since it was introduced, people are referring to it as BPF. And to what we saw before, the TCP-damp use of it as CBPF or classic BPF. EBPF comes with a much richer virtual machine based on 64-bit registers. It has 10 64-bit registers. In classic BPF, we only have an accumulator and an index register. And thanks to that more powerful virtual machine, it's easier to compile and to have a JIT just-in-time compiler from a normal CPU code. And most importantly, nowadays in the Linux kernel, there's no normal BPF virtual machine interpreter anymore. The classic BPF is transpiled, you can call it that way, to EBPF. So nowadays, even with TCP-damp, you will be using EBPF, again, maybe unknowingly. But the most important feature or features of EBPF, it's not that it has a more powerful virtual machine, it's that it gives us extra features apart from networking filters. One of them, which is what I'm going to be talking about today, is dynamic tracing. And it offers other features like maps and events to let you communicate with your EBPF program in a more efficient manner, lowering even further the communication between user space and kernel space. Also, it's safe in the sense that there's static analysis on your EBPF program through a kernel, what they call an internal verifier, so that your EBPF programs cannot crash the kernel. So the memory which is being accessed is monitored, loops are not allowed, which is a pretty big limiting factor, which ensures that your kernel won't crash when it executes an EBPF program. And EBPF, apart from having a more powerful virtual machine, makes use of existing kernel technologies. One of them is a KPROV, which operates in a similar way as what you do with a debugger on user space, basically it injects some code at any point in the kernel. It replaces an instruction with a jump. It will jump to your probe, execute whatever you want to execute, inspecting things in the kernel and whatnot, restore the context and continue execution. So before EBPF, and actually right now, because KPROVs can be used independently on EBPF, you typically would use them in a kernel module, for instance, you need it to work with them on kernel space. That means that they're unsafe in the sense that you can crash the kernel with them. And they're architecture dependent because the code of the KPROV needs to be coded in whatever CPU instruction said the kernel was running on. And they're pretty fragile because you inject a KPROV at a symbol plus the offset, right? And different kernel versions will have different symbols, data structures will be slightly different, so you need to be super careful about that. A patch or workaround to try to make up for that drawback is to use something called trace points, which are fixed injection points in the kernel. But I think that goes against the flexibility of doing dynamic tracing. Using trace points is basically doing static tracing because they're fixed, right? So when using KPROVs with EBPF, which allows us to do dynamic tracing, instead of injecting a KPROV in kernel space or on a kernel module and injecting your native CPU code, which can break and crash a kernel, what we do is we inject a piece of EBPF code, executed by the EBPF virtual machine. It's safe in the sense that it won't crash a kernel, assuming that the static analysis applied to that EBPF bytecode is correct. It's safe in the sense that it won't crash a kernel. Of course, it's not safe from the security perspective that maybe you will be able to reveal details about the kernel which shouldn't be available to every user. But the nice thing is that it's architecture independent. So if you write your EBPF program for, let's say, Intel 64-bit, it should run on ARM as well because it's executed by a virtual machine. But unfortunately, it will still be fragile. Why? Because KPROVs are KPROVs. You will still inject it at a kernel symbol plus an offset, which means that even if it's an EBPF piece of bytecode which is architecture independent, it very much depends on the API version and data structure versions. EBPF comes with extra features which are maps. We mentioned before how we use BPF filters to transfer only the packets we are interested on from kernel to user space. So EBPF introduces something called maps which lets your EBPF program make a summary of what's happening on the kernel, insert it in a map. For instance, you want to have a histogram of network latency and only transfer it from time to time to user space to be printed, for instance. In that way, you don't need to transfer every single event and create the histogram in user space. EBPF makes use of an existing kernel feature which are perf events which let your user space program be informed about things happening in the kernel without needing to do any polling. EBPF comes with a compiler toolkit which is called BCC which is coded in Python and it simplifies the development of EBPF by quite a lot. And here's where the... I'm just going to spend a couple of minutes talking about how we're using EBPF at width. In width scope, we need to track all these connections in real time meaning that if a process A connects to a process B we'll represent it through an edge. If that connection disappears, we will remove the edge and we need to do that for any processes or containers being monitored in the cluster. Before EBPF, we were doing this by polling the PROC file system which is super racy and CPU intensive. You need to periodically go through PROC, go through PROC, go through PROC and that is super, super expensive. Plus the PROC file system is not made to provide you with that information. It's played across different files so you won't read them atomically and so on. So you cannot catch short-lived connections. You have contract on the other hand which will tell you about connections but won't tell you about the PIDs of the processes being involved. So you cannot draw that graph we need to draw. We started with a BCC-based tracker but that gave us quite a bit of problems because it has runtime dependencies, it comes with an LLVM backend and it depends on the kernel headers and also we have the problem of fragility with K-probes which depends on the kernel version and data structure is changing all the time. So how we solved this is in cooperation with Kim Volk who are organizing this conference. We created Go BPF which are Go bindings for BPF. We thought that the runtime dependency on Python, a scope is coded in Go so this was a really good fit for us and we implemented an offset guessing TCP tracker. Basically what we do is as an initial phase on scope we make connections to known ports in which we control we know what the fields of the socket data structure in the kernel should look like and we evaluate what information we're getting by different offsets. So we adjust dynamically to the data structure of the socket in the kernel without depending on headers and without depending on kernel versions. This is a bit of a complicated guess but it lets us be version independent. And this is where we stay in the ecosystem in terms of using a BPF. It's much, much simpler to use but of course it gives us a lot less features and that was me. We have time maybe for one question or not at all, yeah. Any questions? Yeah, one question. Is Go BPF a compiler? Go BPF is it invokes, I believe it invokes the compiler, yeah. I think Alvan is here. He knows more about it. He can actually answer that question for you. Okay, so you can choose whether to use a compiler or provide the bytecode yourself. In our case, we're providing bytecode because we don't want to depend on the compiler at runtime. Any other questions? Okay, thank you.