 Hi, my name's Liz Rice, you may know me through my role with the CNCF where I am the chair of the Technical Oversight Committee and I am also the Chief Open Source Officer for ISOvalent which is a company very involved in EBPF technology and we're also the original creators of the Cillium project. I've done some talks before introducing what EBPF is and how to write your first EBPF programs and today I want to take that down the path of networking and introduce you to some of the ways that you can use EBPF in networking functionality. Now you probably know that EBPF allows us to run custom code in the kernel. We can write programs that we load into the kernel and we have them triggered in response to events. There's a huge range of different events that we can use to cause an EBPF program to run. You'll probably have seen examples involving system calls. For example, the K probe which is the function entry point to a function that runs when a particular system call is called by a user space application. But system calls are really just the beginning. There are many, many other places in a Linux machine where we can attach EBPF programs. The typical hello world program that I would normally write for an illustration would attach to a system call like execve and whenever a user space application calls that system call, execve is used to run a new executable. Every time user space, for example, you type a new program name into your shell, you hit return that causes execve to run that program. We might write a hello world example that attaches to the entry point, the K probe of that execve function. That would generate some tracing every time someone ran a new program. That tracing would include some contextual information about the process that was running that triggered that call, that particular function call, that system call in this case. But K probes are really just one of many different types of EBPF program. There are a long list of different EBPF program types that we can attach to different types of events. So, EBPF is not just about tracing system calls. Tracing calls are the interface between user space and the kernel. There's a huge wealth of other places where we can attach EBPF programs. Another illustration of that is this diagram from BCC. This is a whole set of EBPF based observability tools. I think Brendan Gregg was probably responsible for this diagram. The point of showing this is to illustrate a huge range of different parts of the system that can be instrumented using EBPF. Today, we're mostly going to concentrate on that green box in the middle that is really about networking-related activity. Another way we can look at the range of different events that we can hook into is with perf list. If I bring up my screen, we can ignore the code at the top for now and I'll just run perf list and you'll see a huge number of different types of events that we can instrument with perf and there's just lots and lots of them. So I hope that gives you a sense that there are a lot of places, a huge surface to which we can attach EBPF programs. There's a huge set of network-related events where we can attach EBPF programs. I won't have time to cover all of them but let's try and introduce ourselves to a selection of networking events and how we might use them to build network-related functionality using EBPF. So the first place to start would be with K-Probes. There are lots and lots of functions that relate to networking. Every time we call into the entry or exit from any function in the kernel, we can trigger an EBPF program. We can attach it using a K-Probe or a K-Rep-Probe for the exit and as an example, we can use the kernel function that gets called when a TCP connection is set up. So this is going to be the sort of hello world example for networking. I'm going to be using the BCC framework, which is a good place to start if you're familiar with Python. It makes it very easy to write your user space code. It will run the compilation step for you to compile your BPF programs that are going to run in the kernel. It makes it easy to attach those two events as you'll see. So my user space code starts by reading and compiling the EBPF functions that I've written into this C file, network.c, we'll look at that in a moment. And it's going to attach a function that's called TCP Connect to the event that's associated with the entry to TCPv4 Connect. And I've got a little bit of debugging that's going to tell us when my EBPF programs are loaded and we're ready to go, and then my user space code is going to print out whatever tracing has been generated by my EBPF code. And let's take a look. The function here is very simple. My TCP Connect program simply traces out a message. So let's just take a quick look at the interface in my container, which has a loopback interface and F0. And the F0 address is 172.17.0.2. So let's run this code, which is very straightforward if I can type. There we go. And that tells us it's ready. So first thing to notice is there could be other TCP connections being generated on this machine and my EBPF program is going to see all of them. If I run a CURL request to that IP address we saw a moment ago, I'm not expecting this to succeed because I don't think there's anything listening for HTTP requests inside that container. But we can use it to show that it triggers my tracing. It triggers my EBPF program to run. So that's K-Probes. Now let's move on to some more networking specific program types. And the first one we'll look at is socket filter. So I'm going to attach to a raw socket using this socket filter program type. And a raw socket sees traffic coming in and out very close to the network connection. So it's kind of closer to the network interface than the rest of the networking stack. Socket filtering was designed for observability. And it's worth noticing this. We can look at packets and we can make decisions about whether we want to pass a copy up to user space for observability purpose. So the idea is that if you were running some kind of tracing utility in user space, you'd probably want to trace a subset of packets. Maybe you would want to trace TCP packets but not other sorts of networking activity. So let's try an example of this where we're going to attach to a raw socket with a socket filter program type. And we're going to filter the copy of packets that gets sent to user space. So here's my C code. So this is the EBPF program receives a socket buffer. And it's going to step through that buffer looking first of all at the Ethernet header to see whether it's an IP packet. If it's not an IP packet, it's just going to return zero. Then it looks at the IP header to see what type of request or what type of packet this is. And here I'm going to trace out whether it's ICMP, ICMP commonly known as PING, or TCP traffic. And if it's TCP traffic, I'm going to return minus one and that tells the kernel. I want to send a copy of this packet to user space. I'm going to need to write some code in my user space side to listen for this traffic. So first of all, let's stop listening to TCP connect events. And instead, we're going to load a function, load that socket filter EBPF program, attach it to a raw socket. And then we're also going to create a socket that can, where we can receive these copies of the network packets. And then instead of printing out the tracing here, I'm going to read those, read those packets and comment that and indent it. Okay, so as you might recall, we're going to pass TCP traffic up to user space, but not PING. So let's try loading this. So that means if I try and curl to that address, we see data being sent up to user space. But if I ping the same address, let's just make it slightly wider so it all fits on one line. We can see that the PING is happening successfully, but we didn't see a copy of that in the tracing output. Okay, so that's an example of socket filtering. We can only really use this for observability purposes. We can't use it. This is filtering what gets sent to user space. So we couldn't write kernel-based networking functionality using this. Now let's look at XDP, which stands for Express Data Path. Now, Daniel Borkman, who is a kernel maintainer who works at ISO Valent, told me a great story about how the idea for XDP came about. And it was essentially saying, wouldn't it be great if we could run EBPF programs rather than running them on the CPU? What if we could run them on the hardware itself for networking? So when an in-band packet arrives on the network interface card, if we could inspect that packet and maybe decide that we're going to drop it or send it back out the same interface, if we could do that on the network interface card, it would never have to spend any resources on the CPU itself. So if you have a network interface card or a network driver that supports XDP, you can load BPF programs into that hardware itself, hardware offload. And when a packet arrives on that network interface, it can trigger the EBPF program that can make decisions about what to do with that packet, whether or not to pass it up to the networking stack or not. Not all network cards and network drivers support XDP. So there is support within the kernel for loading EBPF programs as soon as network packets arrive within the kernel. So that, you know, we can achieve the same kind of functionality. There is some performance gain because it hasn't had to traverse the networking stack, but it's not quite what was intended by doing that hardware offload. Importantly, if we're using containers, we can still run XDP because with the kernel implementation of XDP, it can be attached to virtual network connections as well as physical network connections. But this is only about inbound packets. If we think about it in the context of a network card, it really makes sense to process inbound packets as quickly as possible before we pass them to the CPU. But if the CPU has already handled a packet and it's on its way out, then the CPU has already done whatever's necessary. So let's take a look at an example of an XDP program. So now I use a space code. This is pretty straightforward. We can load a function that I've called XDP and we attach it to XDP on the interface, F0, which is the same as what we used for the socket filter before it's defined up here. Also, I want to go back to printing out tracing and we're not dumping the data anymore. Let's take a look at that XDP function. So for readability, hidden some of the code inside this is ICMPP request function, but it's very similar to what we just looked at in the socket filter in case. We're taking the packet, we're looking at the ethernet header and then at the IP header to see whether or not this is an ICMPP ping request. And if it is, we're going to trace that out. Okay, so let's build that and load the functions into the kernel. And then we can run ping again. And in fact, I'm going to run this with just one ping request because it'll be easier to read. Oh, there we go. So we see the ping request coming in through XDP. So think of that as the packet arriving at the network interface and then it gets to the raw socket where we can see first of all, the request which is destined for an address sending in zero two and then we're actually seeing the response going back through the socket filter because the socket filter sees traffic in both directions. All right, another program type that we can attach to is related to traffic control. And this is a facility within the kernel where we can classify or run actions on network packets. So we can attach these filters to what's called queuing disciplines. We can do this in both ingress and egress. My example is going to use ingress. So a packet's going to arrive. The idea of this traffic control is you might use it to prioritize certain types of traffic, do traffic shaping. You could also use it for filtering certain types of traffic. There are lots of different types of filters that you can add. EBPF programs is one type of filter that you can attach to these queuing disciplines. So my example is going to attach, we're going to use the VCC add filter command to attach a EBPF program to an ingress on a queuing discipline. Okay, so my program, again, doing something very similar, we're going to trace out the fact that we got a packet and then we're going to look through the packet using the same function as before to check whether it's a ping request. And in this case, I'm going to drop it. This TC act shot says this packet is shot. Just don't, we don't want to pass it on up the stack. So TC is going to, when that ping request reaches this TC ingress part of the stack, ping requests are going to get dropped. So I need to load this. It's a little bit more involved here because we need to create the queuing discipline and add the filter. And before I forget, I need to make sure it gets unloaded at the end. Okay, so I'm going to start with TC as the function that we just looked at. And this code here adds the queuing discipline if it doesn't already exist and then attaches that EBPF program as a filter. So let's give this a while, let's run a ping. So we can see the ping arrived at XDP. It arrived at the socket filter. It arrived in TC ingress and then we will have dropped it here. So we don't see the response going back through the socket filter. This looks like some other packet that's arrived on that same interface. My guess is that that's probably some kind of ARP traffic going on. We could add some code to check whether that's true. Let's run another ping just to demonstrate. Yeah, we're not seeing those pings being responded to. 100% packet loss with successfully dropping traffic using TC. But we don't just have to drop packets. We can also modify them and we can even make decisions about sending them through different interfaces. So in the next example, what I'm going to do is respond to that ping request. So rather than dropping it, I'm going to convert the request into a response. That means that the ping request, instead of having to go all the way up the networking stack to use a space to get a response that then comes back through the networking stack, I'm going to create that response within the TC ingress filter. Okay, so I have a function to do this called TC ping pong, which we're going to load instead of the function we just saw that dropped packet. And ping pong is here. If it's not a ping request, we will return okay, which says just do whatever you were planning to do with this packet. If it is a ping request, we're going to swap the source and destination addresses, both for the MAC addresses and the IP addresses, because we want the packet to go back to whoever sent it to us. We need to modify the type of the message. So this is saying this is a response rather than a request, hidden within this function, it's also updating the ICMP check sum in the header. And then we have to do something maybe a little bit special here. What we're going to do is say, throw away the original packet. We don't want you to forward the original request on up the networking stack. So we're going to be responding with this TC act shot, which is what we use to drop a packet. But before we do that, we clone a copy of the packet, the socket buffer, and we tell the kernel that we want to send that socket buffer through actually the same index that it arrived on. So let's give this a try. And that's loaded. So we're hoping to see successful ping responses this time. And we did, that was pretty quick. We saw the packets responded 0% packet loss. And what we could see there packet, the request comes in XDP. It goes through socket filter gets to TC. TC looks at the message, but gives us a little bit more information about that message. And then we can see the response, which is destined for address ending in one being sent back out through the socket filter. Now you might say that's not super convincing. How do we know that the packet didn't go up through the full network stack? One way we can check is to use that perf utility to trace out networking related events that happen as those ping messages traverse a networking stack. So instead of just sending ping directly, I'm going to send it with perf trace. Trace. And I'm interested in all the network events. And I'm going to send it to an output file called, let's say with TC. And then I want to run ping. We'll send it once. And we're sending to the same address as before. So that's given us exactly the same tracing that we saw before. But if we look at that file with TC, we can see a number of different networking events that were generated as those packets were, well as the ping packet was received and then the response was generated and sent out. I ran that earlier without TC and this is the result. And we should see that the result without TC is a little bit longer than with. Yeah, and we should see, starts with the same calls. And we're missing out on some network interface RX calls here. So that's telling us that something further up the stack that was receiving packets, we don't need to call that because the packet never gets that far. So hopefully that has given you an illustration of how we can use EBPF networking functionality to bypass parts of the networking stack and achieve performance gains. And this is one of the reasons why EBPF is so powerful for cloud-native networking. In Kubernetes, we run our application code in pods. Those pods have their own, typically have their own networking namespace separate from the host's networking namespace. And in each of them, there's a separate TCP IP stack. So imagine a packet destined for our application running in a pod. It arrives at the physical interface to the host. It traverses the whole IP stack on the host, then gets passed out through the virtual ethernet connection to the pod where it goes through the pods networking stack. With EBPF, we can bypass a lot of the host networking stack because we can see the address that that packet is destined for. We know that that is one of the pods because we're a Kubernetes CNI and we've been responsible for setting up those endpoints. So we know the packet is destined for the pod and we can send it straight to the appropriate virtual ethernet interface. That makes networking significantly faster and there's a really great blog post about CNI benchmarking on the Cillian website that you might want to check out for more details. Another huge advantage of EBPF is that it's instrumenting the kernel and we don't have to instrument individual applications. We don't have to change applications or make any configuration changes. Contrast this with a sidecar model where we typically use a sidecar model for a lot of observability and security and networking purposes. Now, a sidecar needs to be injected as a container into each pod that needs instrumenting. And to get that sidecar container into the pod there has to be a YAML definition for it. You typically inject that YAML automatically through some kind of process, maybe admission control or maybe during your CI CD system. But that YAML has to be defined. Contrast that to an EBPF enabled tool where the EBPF programming is loaded into the kernel and gets access to any events regardless of the pod or the application. So we don't need to modify our pods in order to use EBPF for observability purposes, for security purposes and even for networking purposes that we might otherwise push into a sidecar. If we think about networking in particular that creates a lot of really cool capabilities. So the fact that we can inspect packets gives us some really powerful observability tools that are really performant that can be mapped to other aspects like Kubernetes identity. So we can have really detailed mapping between what network packets are flowing and which Kubernetes pods and services they relate to. We can get some really in-depth security forensics from this information. We can use the ability to drop packets or to modify them to create network policies. We can do encryption in EBPF. These are all really great benefits for security. And the fact that we can send packets to, we can change the destination for a packet allows us to create all sorts of interesting network functionality. Things like load balancing, routing services and particularly interesting I think is the idea of an EBPF enabled service mesh. And if that's something that you're interested in do come and speak to us about Scylium. EBPF is enabling the next generation of service mesh because we don't have to instrument the pods that make up those services in order to have these service mesh capabilities. So I hope that's given you some insight into why I'm so excited about EBPF and how we can use it, particularly for networking related functionality. If you're interested in writing your own EBPF programs my GitHub repo has some starting points. If you're interested in writing EBPF programs I would thoroughly recommend getting involved in a project such as Scylium. We have lots of issues marked as good first issues. So do come and check those out. I hope that you're going to have lots of questions for me. I'm pre-recording this now but I will be online and looking forward to speaking with you.