 Well, hi everyone. Welcome to my talk. I'm Aditi. I'm a software engineer working at ISOvailin and this talk features joint work with Martinez. A bit of context into why debugging Kubernetes networking is relevant for us. Martinez and I work on Cilium which is an open source CNI powered by EBPF and we work with one of the most sophisticated set of users and to enable them to network and secure their Kubernetes clusters. So yeah, let's get started. In this talk, I'm going to share our debugging experiences and present a new tool that we have developed using EBPF. So let's take a common Kubernetes cluster where our request is coming on to Kubernetes node and it's being delivered to a service pod. So the request traffic will first get processed at NIC. It will then get routed to Linux kernel networking stack and then it will traverse from host network namespace to the pod network namespace via a couple of v devices pair. So whenever there are network connectivity issues, we start with control plane, checking all the configurations are correct, our services are deployed, the service pod is up and running and we always make sure that it's not the DNS. But today I want to draw attention to a key component that often treated as a black box and that's the Linux kernel network. Debugging Linux kernel networking is hard and let's zoom in to see why that is. So this is a detailed network flow diagram of Linux kernel network internals. And as you can see, there are overwhelming number of packet processing functions. So when you send a single packet, it requires many, many hops until it reaches its destination. So sure, there are like packet counters and stats that you can observe. But these are not enough to get to the root cause because kernel state can't be observed in real time in an easy manner. So we have this internal joke that whenever we're trying to track a packet in Linux kernel, it's like finding Waldo. One can easily feel lost when they fall down the rabbit hole of Linux kernel networking. So next we are going to look at some of our most Godu tools. So one of our most favorite tools is TCP dump and you always reach for is whenever they are like network connectivity issues. And I agree that it's a good tool to start your debug routine. But the problem with TCP dump is that it only gives you high level information because the TCP dump tap points are at the periphery of Linux kernel networking stack. But in real life, situation is much more complex than either the packets are dropped in the NIC or in the stack or in the pod network namespace and so on and so forth. So next up, moving on to like traditional logging ways methods whenever developers don't have access to good debugger, they always rely on logging based debugging. So kernel exposes this printk function, which you can use to add debugging statements in the kernel code. But it requires recompiling the kernel and in many cases reboot. So it's not a viable option in production environments. If you're not careful with your debug statements, it can cause kernel panics. In short, it's very slow. It has very slow debugging cycles because it requires many, many iterations. Let's say you ruled out that DNS is not the issue here and you want to now trace packets going to the cube API server, you can't easily do that. So moving on to some of the generic tracing tools. So a couple of examples are Perf or BPF trace. So Perf is a performance measuring tool that also allows you to trace calls to some of the network Linux kernel network functions. And if you have used Perf before, then you can identify the snippet that I've added on the slide. So in this case, I'm using Perf record to record all the calls that are going to SKB free function, which is invoked whenever a packet is dropped in the kernel. And Perf script is then used to dump traces that are collected from Perf record. And as you can see, the output is very limited. It has very limited filtering capabilities. So if you're trying to debug a DNS issue, you can't specify, hey, I want to trace all the packets that are going to port 53. There is no information about, let's say, for example, source or destination IP addresses, ports. And as I mentioned before, whenever you use Perf record, you have to specify what function you're trying to trace calls to. And in many cases, you don't have that information when you're starting out with debugging an issue. So we approach this debugging problem in a very different manner. So not all network connectivity issues involve packet drops. And if I want to introspect kernel network state in a fine-tuned manner, then is there any way to get a list of all packet processing functions? Next up, we want to get callbacks to whenever these functions are executed so that we can analyze kernel state in a more detailed fashion. But more importantly, we want to filter these callbacks only for traffic that's relevant for us. So this is the wishlist that Martinez and I came up with. And this brings me to the second part of my presentation where we are going to find answers to these questions using eBPF. So eBPF has captured the imagination of many recent years. And companies like Facebook, Netflix, Isovelin are using eBPF to solve a spectrum of use cases. So to set the context for our new tool, let's look at what eBPF is. So eBPF is an internal virtual machine that safely executes native code on certain events. It's highly programmable and it's performing because the code is executed directly in the kernel. And because of this reason, it's touted as the JavaScript for the kernel. And let's unpack this information with an end-to-end workflow next. So here I have a simple eBPF program where I've added a K-Probe to one of the network functions in Linux kernel. And K-Probes is a debug mechanism in Linux kernel where it allows you to execute eBPF programs whenever any kernel instruction is executed. So when you attach a K-Probe in your eBPF program to a Linux kernel function, your eBPF program will be executed whenever that function is called. So in this case, as you can see, I'm getting callback or rather I hope that I'll get callback whenever IP local deliver function is executed. And SKB is the Linux representation of packet. So in this program, I'm going to parse that SKB dumps some fields and let's look at what how we can compile it next. So this program is compiled using Clang, which outputs an ELF binary that contains eBPF bytecode. It's next fed to an eBPF loader. So the loader is going to parse the ELF file and the binary and it's going to set the context for this program, which includes like setting the type. So in this case, the program type is bpf proc type K-Probe. And this type is going to decide what kernel state my program is going to have access to. Next, the loader triggers loading and verification of this program. So in the previous slide, I mentioned that the kernel ensures that bpf programs are run safely. So eBPF verifier is tasked with ensuring that. The verifier will ensure that your program doesn't have null pointer dereferences or it's not trying to access any out of bound memory and whether your program terminates or not. And if you have written fair amount of bpf, if you've done a fair amount of bpf programming, then I'm sure you have interesting stories to tell about how you fought with the verifier. So once the verifier approves your program, it's just compiled to native code. And I mentioned earlier that eBPF follows an event-driven model. So what does it mean in this case? In my program, I've attached a K-Probe to IP local deliver function. So this function delivers packet to a local destination. So let's say a packet is delivered on the eth0 interface on your Kubernetes node and it's supposed to be delivered to a local eDuring part. Then the packet will be executed by IP local deliver function where it reassembles IP fragments and hands it off to the next level or layer, excuse me, which is the transport layer. So in this case, it's going to generate an event and my bpf program is going to be executed. And as the bpf program, if you can see just before it returns, it writes this notification to bpf map. So bpf map is a shared data structure between kernel and user space. And that's how kernel relays this information to the user space. So great, yeah. Just as we traced callbacks to the IP local deliver function, can we just get a list of all the kernel network packet processing functions and just keep like a hard coded list? Not exactly because kernel function signature can change across kernel versions. So K-Probe API is considered unstable. So how do we reliably get debug information about different kernel function signatures? So that's where bpf comes into play. bpf is short for BPF type format. And it's a debug format that stores information like function signatures, different data structures that are defined in kernel, and so on and so forth. But the cool thing about bpf is that it's super compact. So in recent kernels, it's packaged in kernels by default. So kernel exposes its own debug information like function signature and different data structure types via ASYS-FS interface. And yeah, okay. So just to get a sense of what bpf information looks like, I've dumped a bpf information for a simple function that accesses the SKB structure. And on the right, you can see that it gives you information about various types. So if you start at the top, it tells you that there is a pointer to the structure SKBUFF. And what different members does the SKBUFF structure has and what are their respective offsets? But one important type that I want to highlight here is the funk proto. Funk proto defines function signature. So what kind of arguments does a function accept? What's the written type is? So with this, let's revisit our wish list. So we mentioned that, hey, we wanted a reliable way to get the list of all the network processing, packet processing functions in the kernel. And we can easily get that information from the bpf file. Next, we can attach kprobs to all of them. So just as kprobs allow you to execute bpf function when a function is executed, kredprobs allow your bpf programs to be executed when the function where the kredprobs is attached is returned. And more importantly, we can easily filter callbacks using with the help of bpf maps. So with this background, I'm really excited to present Peru. Peru is an ebpf based Linux kernel debugger. And Peru is short for packet, where are you? Conceptualization of this tool, credit for conceptualization of this tool goes to Martinez. And next, we are going to take a walkthrough of the Peru internals. So Peru consists of a user space agent written in Go and it interfaces with user to collect filter information. So let's say I'm trying to debug traffic that's going to the QBAPI server pod. So I run Peru with this filter parameters where I set the protocol to TCP, destination IP set to the QBAPI pod IP, and the destination port is set to 443. Next, Peru goes ahead and programs this information in this filter map, which is a bpf map. It then uses the selium, excuse me, sorry. It then uses the selium ebpf loader to collect bpf information for the underlying kernel. So in this case, it's going to do a call to this sysfs interface. It's going to search all the kernel functions, and then it's just going to iterate over that list and filter them to collect functions that accept SKB as one of the parameters. And just a reminder, SKB is the packet representation in the Linux kernel. It then goes ahead and loads our ebpf programs to all these functions. So in this particular case, I'm interested in getting callbacks from the Linux kernel whenever my packet goes through all the network functions in the Linux kernel. So for example, in the previous example, we saw how my bpf program was executed or whenever IP local deliver function was called. So the E0, for example, is the first event that gets called whenever my packet is executed by IP local deliver. IP local deliver is then going to hand off that packet to par layer 4, which is like TCP in this case. So that's going to be E1. The one thing that I want to highlight here is that my bpf program is going to filter all this callbacks based on the filter information that it receives from the bpf map. All right. So in the beginning of the presentation, I mentioned that Linux kernel networking is treated as a black box. And because of the complexity that's involved. So with the help of Peru, I hope I can convince you to give it a try and see how easy it is to debug Linux kernel networking issues. So in the lower half of the clip, I'm running Peru with a set of filters. So for example, I'm filtering traffic destined to 1.1.1.1 destination port 80 and protocol is set to TCP. So I've added an IP table rule here, which is going to drop packets to this destination and port to Peru. And whenever I run a call request, it's going to generate output trees. And I'm going to know that, okay, yeah, this is where my packet is getting dropped. So in the next few slides, we are going to look at real world examples and how Peru helped us to debug those scenarios. So the first example, I have a Kubernetes cluster where I'm running a multi homing setup. What does it mean? So in this setup, a pod is assigned IP addresses from two different IP subnets. So let's do a request walkthrough from right of the screen to the left, where a pod is trying to reach the QAPI server pod in my cluster. So the pod does occur request where the source IP address is set to the pod IP address and the destination is the Kube service cluster IP. This request is going to hit the IP table rule that's installed by Kube proxy. And the rule says states that, hey, all the traffic that's destined to Kube service cluster IP, it needs to be denoted to Kube API server pod IP address. So after the rule, we can see that the destination IP address has been translated from the cluster IP to the IP address of the Kube API server pod. This request reaches the destination Kubernetes node, and it's routed through the eth1 interface, and then it's delivered to the Kube API server pod. Now the Kube API server pod is going to send a reply back. The source IP addresses correctly set to the IP address of the Kube API server, and the destination IP address is the IP address that was received in the request packet. And this packet is being routed through eth0, not eth1. And there is no surprise that this reply never reaches the source. So let's see how Peru helps us in debugging this issue. So here, I'm running Peru with a bunch of filters. I'm filtering traffic going to destination port 443, which is where Kube API server pod is listening on. And I've set the protocol to TCP. And I invite you to look at the output of Peru here. The NF, so this is the trace of all the kernel functions that my traffic of interest is going through. So NF hooks low is the kernel function that's executed to implement IP table rules. And we can see that the source IP address in this case is still set to 192.168 and not the IP address from the subnetware that's used to connect to the Kube API server. So that's why this packet will be dropped when it's sent out of the eth1 interface. Next up, we frequently have MTU misconfiguration issues in our clusters. So I have a simple setup here where I have a single node Kubernetes cluster. My UDP pod is getting traffic from an external entity. And the size of the UDP packet is pretty large. And the sender has said don't fragment bit on this request. So this request is going to traverse through the eth1 interface on the Kubernetes node. And then it's going to traverse through the vith devices pair. And then it's going to be delivered to the UDP pod. And based on the MTU information, we can see that there is a mismatch. So let's see what happens to the traffic. So again, I'm running Peru, but this time I have changed, I've set the protocol to UDP. The destination port is 443. And I have a new filter here, which says output mirror, which is going to print a bunch of SKB members here. So as you can see, there is a mismatch between packet length and MTU that's configured on the if index. And the if index 18 refers to the eth1 and 7 refers to the vith0. And as a result, this packet will be dropped. So lastly, there is a security mechanism in Linux kernel. It's called reverse path filter. It essentially ensures that the source address set on packet, received packet is routable via the same interface that it came in. And this is to ensure that there are no spoofed packets where IP address, the source IP address is spoofed. So in this case, I have an incoming request coming to my pod. Let's say this is my service pod and the request is translated, denoted to my pod. So the destination is set to the pod IP address. So when this request is received on the eth1 interface on my Kubernetes node, there is an RP filter configured on this interface by default. And it's set in the strict mode. So this request will be dropped because if you look at the IP route table here, it says that all the traffic that's destined to 10.0.0 slash 24 subnet, it needs to be received on the vith0 interface. But in this case, we are receiving it on eth1. So let's see what information Peru gives us here. So this time, I have set the destination IP filter which is set to my pod IP. And we can see that the packet is being dropped. And let's see why that is. So just before the packet is dropped, we can see that there is a callback from fibvalidate source function. So this kernel function essentially executes the RP reverse path filter logic where it will check what's the source IP address on this packet is and if it's received on the interface that it's expected to. So here are the Peru highlights. Peru is an EBP based open source tool to debug kernel networking. And it does so by abstracting kernel networking details. But more importantly, Peru exposes advanced filtering capabilities. And it brings packet level metadata so that you can introspect kernel in a fine grained manner. And it's portable across kernel versions. So some acknowledgments. So Peru uses the EBPF kernel functionality based on K probe. So thanks to VPF kernel developers, Steven Rusted for K probes. We are using the BPF SN print FBTF function that's exposed in the kernel so that we can print SKB members in a safe manner. And Peru uses Cillium EBPF Go library as an EBPF loader to load our to load EBPF functions. And lastly, we recently discovered that there is this new function, new tool. It's called IPF trace 2. And it has some similar functionalities as Peru and it was released before Peru. So we are acknowledging it here. So yeah, thanks for your time and I open it for questions.