 Hi, my name is Tomas Rumi and I work for TIGERA and I'm here to talk about the project Calico and its eBPF data plane and I'm present at work. We've been done jointly with Sean Crampton, Shri Amar Devon, Mazda Knossop and myself over the past couple of years. In the rest of the talk, I would like to give you a brief overview of Project Calico. I would like to talk about the eBPF data plane, how it differs and about the tools, about how we can troubleshoot policies, how to get statistics and log-in out of the data plane. This talk is not meant to be a pitch for any observability features, but it's really meant to reflect on the feedback from the community we got about how we can improve on the project. Lots of users or Calico users are switching from IP tables to eBPF, but that switch is not always as smooth as the users would hope for, because the environment has changed dramatically. So Calico is probably, I don't have to introduce it that much, but Project Calico is a CNI for Kubernetes and also works in other environments. So it's responsible for networking and provides network policy model. That model is richer than what Kubernetes natively support, but Project Calico completely implements that Kubernetes network policy model as well. So Calico is pretty much a control plane, right? It collects information from the Kubernetes environment, it merges that with the user provided data policies and compiles and passes it to a data plane that executes those policies. Calico is essentially a multi-platform project. It supports multiple data planes. It naturally runs on IP tables. It supports the eBPF data plane, and that's what I'm going to talk about here today, but also there are other data planes like the VPP contributed by Cisco and Calico also runs on Windows and Hootgas. We can also leverage some eBPF on Windows. The data planes differ depending on user's needs. In the IP tables you have pretty much everything provided by the kernel and Calico just programs the rules. While in eBPF, kernel provides the environment, but the data plane is actually provided by Project Calico. The code is injected in the kernel and the code comes with Calico itself. Well, eBPF is here. Well, it's been here for a while. I think we've released the eBPF data plane about two and a half years ago, and it's great. It provides better scalability of services, provides lower latency within the nodes, comes with integrated Qt proxy, which results in less conflicts when multiple users have to fight over IP tables. The cool thing is that we can add new features, like the fact that if your client is accessing your cluster through a node port, then we can preserve the client's IP all the way to the destination pod, which helps the users to write more sensible and more robust network policies. Something that wasn't possible with IP tables. It allows us to implement DSR, which helps with performance. And in general, eBPF provides us with lots of flexibility so that we can tweak existing Kubernetes features or when there is a need or users have a need for a specific feature that is not supported by the mainstream Kubernetes, then it's possible to extend it. But let's first start with IP tables and the good old days. In those days, it is a super well understood data plane, but that data plane doesn't necessarily scale as well and it's not very extendable. If you would like to try to expand the IP tables, you would have to provide a kernel module and good love is persuading your users to load your kernel module in their environment. The good thing is that all the rules are in the IP table, so when you dump the IP table rules, it's kind of human readable. That's great. So the service is implemented by Kube Proxy and the policies implemented by Calico or any other project for networking are kind of in plain size, so you can dump it, and with some F4, you can read it and understand it, right? So this is kind of what people were used to. Linux does the rest of the heavy lifting, it does routing, it handles the protocols, edge cases and so on and forth, so forth. It is super robust and well tested, right? So this is the EBPF data plane. Things change, things are different. We no longer have the rules in a single place in the IP tables, but the data planes are implemented by attaching EBPF programs to typically the TC hooks on ingress and egress of devices. Those programs are free to forward the packet to another device directly, completely bypassing the IP tables. The programs may or may not consult routing within the system or they can decide on their own, so that kind of changes how things used to work. So in general, even though EBPF has been here for a while and is somewhat understood, it's definitely not human readable. If you install EBPF programs, your user can kind of dump the program, may disassemble it perhaps, but not really read it, even though I guess some of in this room may actually do that. The nodes network stacks largely bypassed, the queue proxy that counts as Kubernetes now useless because it implements the services in IP tables and in the host network stack and now the packets don't hit it anymore. And the policies are implemented in the EBPF code and that may refer to EBPF maps, but again, unlike in the IP tables world where you could dump the rules and kind of read it, that's not really straightforward. So from the user's perspective, it's fantastic. We've got something that is scalable, that is faster, that has new features, but hold on. The feedback we've got is kind of like, how can I verify that the data plan writes and implements the policies that we ask it to do? We have users that say, I don't see connections in my contract. It is difficult to verify that the services are programmed correctly. All of that was in plain sight in IP tables world. Some users may even have tools that use the Linux contract and so on for their third-party tools and these environments suddenly break. So this talk is meant to kind of bridge the gap between the new users and the new data plan and what the users were used to and how the environment has changed. This should help the users to adopt it more broadly and contribute to the project and to the community as well. So since inception, Calico comes with kind of small debacle utility called Calico BPF. It's now more comfortably accessible directly through the Calico nodes within your cluster. And that provides a bunch of commands that allow you to introspect or modify the data plan. And I'm going to walk you through what is available and how you can use that. So the first thing, typically when your service or some connectivity doesn't work in your cluster, the first thing you want to answer is, other services implemented correctly. In Calico BPF data plan, it's essentially done as two BPF maps. The first map is a set of frontends that point to the map as a set of backends. And you can use the Calico BPF NAT DOM command to list this in human readable form. So it will print you which service IP is there, it was sort of protocol it uses, it was sort of poured, and how many there are backends, how many local backends there are. On the indented lines, you will get a list of the backends which the service is pointing to. There's the service ID and the backend ID. On this slide, you can see that certain backends belong to multiple services with the same ID. The reason is quite simple, a service can have multiple IPs like the cluster IP, it can have external IP, node port, and so on and so forth. The next thing you would like to check is whether your understanding of what is in the cluster matches the understanding of the data plan itself. So another thing we can dump is routes. It's not quite the routes in the IP route sense of things, but it describes what a certain sider within the cluster means. So on the left-hand side, you will see the sider, and on the right-hand side, you will see the explanation in what the sider means. So it tells you whether in that sider are pods, workloads, whether it's remote or local. If it's local, at what interface it is attached to the host, and if it's accessed via a tunnel, of an overlay, what's the next hop? And a bunch of other good things that are important for your understanding. So if you verify that, and if that matches your expectation, the next thing is to look at the connections themselves. The EBPF data plan has been mentioned before, implements services on its own because the QPROX is kind of bypassing this world. EBPF needs to do all the netting, and for that, it needs to maintain connection tracking so it can do the reverse translation, and it also maintains the connections so that it knows what was allowed into the system and what not. In general, we have a single entry per connection that maintains all the core properties. And in case there is an address translation, we have a second entry that refers to the one that maintains most of the properties. This renders the kernel contract pretty much irrelevant. This is kind of one of the most important things that users are being confused with, that there are only two contracts. So when you use the Coleco BPF tool to dom the contract, you will pretty much see a line per entry. Unfortunately, here it's wrapped because it won't fit the slide. So in the simple case where there is no netting, the entry just shows you that... What's the connection? That's the key. And what's the type of the entry? Here it says that's a simple one. And it tells you other properties of the connection. Like here, it's a TCP connection. It tells you that it's been open and it was gracefully closed. And it tells you also how long ago these things happened. That's important for garbage collecting these connections. In the more complex case, when the service is in play, you would also see an extra entry, which is keyed by the pre-DNet tuple. So you see that the pod is connected to service and that entry carries the reverse key that points to the post-DNet translation and that points to another entry that maintains the information I have described on the previous slide. The difference here is that it also carries the information about the network address translation and a couple of other things. So, yeah, if that looks good, you can carry on. I've mentioned a couple of times that Linux contract is not important anymore, it's irrelevant. It's not quite entirely true. There are scenarios where doing things outside of eBPF is just simpler than doing it in eBPF. Typical scenario is a much trade. Coleco allows you to tag IP pool for not outgoing. So if you're sending stuff outside of your cluster, then the traffic outside of your cluster will have the cluster's IP, or one of the cluster's IP. It is possible to do it in eBPF, but it's not entirely straightforward. It's inherently difficult to allocate a non-conflicting source port. So why not to let those packets go through the IP tables and through the host network stack and let the kernel do what it does pretty well? We just instruct the connections to go that way and let the Linux track those connections. Nevertheless, the connection is also tracked in eBPF connection tracking. We must do that because we have to tell what should happen to those packets on that connection. So again, eBPF contract is the alternative contract here. So we were talking so far about connectivity in your cluster, but if you're using Project Coleco, you're probably most interested in network policies. That has also changed dramatically. Before Coleco compiled the information from the cluster with the policies and injected it into the IP tables kind of into a single point. While now that information is split and the policy or part of the policy is attached to individual endpoints and individual devices within the system. So it's kind of broken up. These policies are implemented in a program and that program may point to eBPF maps. But from the perspective of a user, it is essentially a binary blob. You can dump the policy, you can disassemble it and if you have intimate knowledge of the data structures, you may actually get something out of it. And that's the component that we got not because the users would need to debug that. It's more like in the zero trust security world, the users want to only trust what they see. So in recent release, we've added a policy dump. So you can specify interface and hook, ingress or egress or in some cases also the XDP hooks and it allows you to dump what sort of policy is in that place in the system. In that dump, you will see what policy is being implemented, what rule of the policy is specifically implemented and some more human readable form of how the rules actually implemented. For those really interested, there's also a dump of the assembly code, more in the human readable form with some extra annotations that allow you to also inspect which state and how that state is actually being inspected by the policy and how is the policy being executed. So it allows you to go through the code and see what happens if the IP doesn't match or match and how the flow through the policy goes. I mentioned the EBPF maps a couple of times and that's mostly common play when the policy use a selector that can match with very many IPs and ciders and that would be too complex to put in code so we put it in what we call IP set. We kind of borrowed the name from the IP tables world because it is essentially a set of IPs and the policy dump shows you which selector is implemented by what IP set and the Calico BPF allows you to go dump the IP set and verify whether the IP set matches your expectations. The next thing we've added recently is set of counters. These set of counters are per interface and allow you to see the obvious things. How many packets went through the interface? How many were accepted by a policy? How many were dropped by the policy? But also allows you to see other things. Packers may get dropped because of errors, because of unexpected events. These counters should be zero. So it's not working in your system and you see that these counters are not zero. That's kind of a smoking gun and you can go and investigate. One of the most important counters that's been added goes back to the policy dumps. So every time a packet hits a policy or a policy rule more specifically, we increment a counter on the rule. So if you dump the policy except the assembly dump we've discussed before, it also tells you how many packets went that way. And then you can observe whether some of the rules are actually being hit or not. Excuse me. So the ultimate option when you're diving into the data plane and debugging your system or debugging the project is that you can turn on login. Since the inception, we had the BPF log level option that we can set to debug. But it essentially turns login on every endpoint for every packet. You get super fine grained inside into what's happening in the data plane, but it's also super slow because it goes into the trace pipe and there are various ways how to dump the trace pipe, but it's all synchronized and slow. It's a firehose. So in the upcoming release, we are adding a filtering option. So essentially, we use kind of like a pcap TCP dump style filters and expressions. So people should be familiar with that that allow us to specify on what interface and what hook we would like certain packets be logged and the rest not. So certain packets may take the slow pass while the rest is taken the fast pass. So you should have a minimal impact and it should be possible to use that in production environments, not just some staging environments. We kind of leverage here the power of the pcap which can compile the TCP dump expressions into BPF code. Unfortunately, that's the ancient classic BPF which has some similarities with the BPF but has to be translated into the BPF first. That's actually what happens if you use TCP dump and the code is injected into kernel and kernel also translates it into eBPF first. Those filters are prepended only if we are in the debug mode and both paths has to execute the same policies so that when we debug in production, the system is not compromised. And believe me, it is much easier to do it this way than when we use the jtrace of similar tools in the HFIP tables. Because as I said before that we had a fire hose, we really want to have a fine grained login here. So we've added an option. Every can specify a list of devices and expressions. We also have some special values like you can run the filter for all the devices in your system. You can run it only for host endpoints or workload endpoints. And Calico always supported the configuration with just either global or local to a node. So you can really specifically select what you are interested within your system. So let me walk you through some of the examples of the login. Here on the first line, on the left-hand side, you see that each line is prepended by the name of the device where the lock comes from and also indicates, in this case, with the I. That's ingrassou, but you also see E for E-Rest and X for XDP. Not surprisingly, we first have to parse the packets so you get some information about what sort of packet is flowing through the system. We have to do contract lookup. In this case, we have a miss. Therefore, this is the first packet of a connection. And in that case, we have to see whether it matches a certain service. In this case, it doesn't match any service. We got an at miss. Therefore, there's a plain connection. And we have to pass it through policy. We have a host endpoint without the policy, so we accept the packet. We have to create a contract entry, and the last thing, we have to decide where the packet goes. In this case, we are using fitlookup, so we're essentially consulting the routing tables of the system. We got a hit, and we can forward the packet to its destination. From this point, it bypasses the IP tables completely and saves the execution time. In the more complex case of services and network-wise translation, you would see lines like these. In the first batch, we've tried to match it against a service, and unlike in the example before, we have a hit in the first table. We got a service with a certain ID, and we picked an ordinal for the backend. Then we go and read the backend and report which backend we selected. Later on, we have to create, when we create a contract, we also have to create a second entry, which I've mentioned before. And as we flow through the rest of the program, we also have to actually execute the DNS. So we have to patch the destination IP and the destination port into the packet, update checksums and all this stuff, and then we carry on with what you see on the previous slide. So with that, I'd like to conclude that EBPF is a great tool, it's super flexible and powerful, and I love working with that, and kudos to everyone who contributed to the EBPF system in Linux. But it also disturbs how users were used to work with the system. It disturbs how users used to think about the flow of the packets through the system, and we need to have a good understanding of that good tooling and we have to actually do quite a bit of teaching. So with that, just use EBPF, enjoy it, it's great, and I'm happy to take any questions if there are any. Thank you, Thomas. I can see a question back there. I can see a couple of questions. I was just curious, it seems like you dealt with writing the trace pipe, like a fire hose to trace pipe, by filtering. But did you also consider using a Perf events or an event buffer, like a PubSub type mechanism instead of filtering to trace pipe? I think there's two slightly different things. The Perf events, A, it's losses, so we may not actually catch everything that we would like to get. We use Perf events for other things, but for this case, this is really print k. When you say that it disturbs the flow of packets, as we knew it, what's the meaning of that? It's just because we are just hooking into the functions in the kernel or for something else. Could you repeat the beginning of the question? When you say that it disturbs the flow of packets, what does it mean? Yeah, great question. It really means that the packet may go from a device to a device and doesn't follow the rest of the path that it otherwise would take through the kernel. So if you hook in the TC and then it's forward to another device, then it just skips lots in the host network stack. It just takes different paths. Actually, I have a follow-on question kind of like that. If I were to land on an arbitrary box... Could you take your mask so I can... Sure, I have a follow-on question kind of like that. If I were to land on to an arbitrary box and not know if EBPF was passing my packets into places, is there any way a user could actually see that since it's all kernel land? Yeah, I think that's where the feedback from the community came in the first place. So we had users who were like, hey, I have a contract which is kind of half open, or I have a connection and I don't see it in the contract. So that's where the confusion came from and that's what we are trying to address here.