 Welcome to the last day of CougCon. Do you guys have a good time so far? Yeah. OK, so let's get started. In this session, we will cover a new and innovative way to use eBPF to efficiently enforce network policies for host processes using eBPF. Before we start a bit about myself, my name is Vinay Kulkarni, and I work for eBay Cloud, where I'm helping build the network segmentation, identity-based network segmentation for eBay. The agenda for today is we start with an overview of what network namespaces are and understand the difference between host processes and processes that are running in Kubernetes pods. We will then cover our use case of efficiently securing host network process communication that motivated this project and look at how host processes are secured today. In order to show what's new in this talk, we need to set some background and context. For that, we'll briefly look at the structure of Cilium identities and Cilium network policies and see how Cilium uses eBPF to enforce network policies for regular Kubernetes pod traffic. We then look at how we may assign identities to host processes. Then we wave that eBPF magic wand and identify, in order to identify host processes inside the kernel in the network traffic data path. Then I will show you a demo of this feature, and after that I will touch upon eBase trust fabric solution which we currently use for securing network traffic in our defense in depth strategy. And then I will open up the floor for questions. Most of you may be familiar with Kubernetes namespaces. They provide us with a way to isolate cluster resources and divide it amongst users. Similarly, network namespaces provide us with a way to isolate and share the kernel network stack amongst multiple applications that are running on the node. On the very left here, you can see it's a blue box. It's a host process, a regular host process. It gets to use the local host and the eZero, the IP addresses, and the routing firewall rules that are there available to the host network namespace. To the right of that, you have this pod one with a single container in it. It gets its own interfaces, IP addresses, routing rules, which is isolated from the host. And pod two here has two containers. They get to share their own local host and eZero interface and the IP addresses and routing rules, but isolated from pod one and from the host. So in that sense, to summarize, the key difference between host processes and Kubernetes pod processes is that all host processes share the network interfaces, IP addresses, et cetera, in the root or the host network namespace, and all containerized processes within a Kubernetes pod get their own set, their own copy of interfaces, IP addresses, and routing rules that is not shared with any other namespace. In that sense, this means that each Kubernetes pod acts like its own virtual machine as far as the networking behavior is concerned. The consequences of this are that pod to host communication happens via the eZero interface in the pod, and pod to pod communication typically transits the root network namespace. And the pod to external communication transits the eZero interface in the root namespace. Now let's look at our use case. Consider this scenario. You have two nodes, a master node that runs the Kubernetes API server, the scheduler, and a worker node, that Node 1 that runs Kubelet, which listens on TCP port 1.0.2.5.0. Now the API server talks to Kubelet on this port 1.0.2.5.0 in order to perform tasks such as exec into a container or get the container logs. This is a legitimate and expected communication pattern between the API server and the Kubelet. Now let's ring in the scheduler. The scheduler is not expected to talk to the Kubelet directly at least. But let's say there is a vulnerability in the scheduler that allows a hacker to take control of that and impersonate the API server. This hacker now has access to the customer workloads. Any request from the scheduler to the Kubelet should be denied. But the Kubelet cannot distinguish between traffic coming from the API server versus scheduler impersonating the API server because they have the same source IP and stolen credentials. So we need another layer of security here. How are host processes secured today? Well, it is accomplished at layer 7 through cryptographic mechanisms using transport layer security or TLS. Identities are established by using certificates. And communication is secured by establishing a shared secret key or a session key to authenticate and encrypt data that's transmitted between them. Now layer 7 security is sufficient. So then what's the catch? Well, cryptographic functions are not cheap. There is computational overhead of encryption and decryption. There is the increased latency. And then there is increased bandwidth needs. This kind of makes it an easy target for DOS attacks. The defense and depth strategy is to have an additional layer of security or as many layers of security that's feasible. In this case, we're looking at having another layer of security that drops malicious traffic early. That is what EBPF brings to the table. Now let's look at how Cilium efficiently secures Kubernetes pod traffic. What Cilium does is it attaches EBPF programs to the pod network interfaces as well as the host network interface. The long story short, these BPF programs perform sender identification and policy enforcement. The BPF programs that handle traffic egressing the pod insert the sender identity into the encapsulation headers that we use. The BPF programs at the ingress, they look up the policy for the incoming sender and determine whether the traffic should be allowed or denied that is enforcement of the policy. In this example, we have the bank teller pod tagged with the teller identity for traffic going out and the BPF programs at the ingress, which is the bank database pod, checks to see if this is allowed and then enforces that policy. In this case, it lets it through. The same BPF programs would deny traffic that's coming from the bank camera or any other source that's not allowed by policy. The drop here happens at the lowest layers of the network stack, well before any computational overhead of cryptographic mechanisms is incurred. Now, we talked about encapsulation, what's the story on the wire? Well, when pods communicate in our scenario where we are using genuine encapsulation headers or if we choose to use VXLAN, the BPF programs on the center side stores the identity of the sender in this VNI field or the virtual network identifier field. It's a 24-bit field. This also means that there is a 24-bit constraint on the value of the identity. What we need is also a way to express the policy and how is this policy expressed? Cillium offers a custom resource named Cillium Network Policy that allows us to specify the policy network policy for our traffic. In this example, we have two pods. We have the bank teller pod with label role equals teller and we have the bank database pod, bank DB pod with label role equals database. Now, a Cillium Network Policy that only allows the teller to access the database is shown here. What it says is that, I'll apply this policy to pods with label role equals database, which would be the bank DB pod and allow traffic, the action that you need to take is allow incoming traffic from pods with label role equals teller, which would be this pod here, the bank teller pod. You're hearing the word labels a lot, as you might guess, they play a key role here. Now, here is another example. Let's say we have a bank auditor pod. It needs access to all database pods in all namespaces. Such a policy is expressed using a cluster-wide network policy that applies to pods with label role equals database. In this case, it says, apply this to all database pods with label role equals database and the action to take is allow traffic that comes from pods with label role equals auditor. Now, there is another piece to this puzzle that is a Cillium Identity, CRD, custom resource. This custom resource allows Cillium to associate labels with integer identity values. These integer identity values facilitate efficient policy lookup in the BPF programs. In this example, the identity value of one, two, three, four, five is associated with the label role equals teller. Generation and management of identities for host processes can be a talk of its own because it's a complex topic. It's a highly subjective and there is no one right way to assign identities to host processes or processes. It can vary from one use case to another and but however, I have to discuss this briefly here specific to my our use case in order to give better context for the stock. So what happens with our use case? We use a unique ID for like kind processes. What do I mean by that? Let's see. One example is all Kubernetes API servers in a cluster would carry the same identity, say one, two, three and all cooblets in a single cluster would bear the identity, say four, five, six. All Yeager agent instances across multiple clusters could bear the same identity value of seven, eight, nine but you may choose to do it on a per cluster basis. You see what I mean. So the common constraint though is that in this setup with encapsulation we have to live with the 24-bit identity, 24-bit number that can go into the VNI field of the Jenny or VX line headers. This does give us 16.7 million unique application identities which is sufficient for our purpose but if for some reason that becomes a limitation, Jenny has this capability of options that can be used to allow carrying identities, the identity values that are greater than 24 bits. So the real crux of the problem is that host processes do not have labels that can be mapped to an identity nor do they belong to a distinct network name space that can be associated with an identity. So the million dollar question is how do we identify traffic coming from host processes in the kernel and assign the network identity to assign the network identity to such traffic? I'm going to digress a bit here to talk about the original idea that is responsible for solving this problem here today. This is not my first barbecue. The idea dates back to a couple of years ago at 2022 KubeCon in Detroit. I presented a use case of quickly resizing a particular pod that is increasing its memory allocation when that pod issued a make command to build code which is a CPU and memory intensive task. On that instance, I used an EBPF program to identify when the specific container that I was interested in ran the make command by looking up the C group ID of the container task. For that, I used the BPF helper function, BPF get current C group ID in the context of the exec system call. This allowed me to look up the C group ID. Knowing the C group ID allowed me to tell between my container running the make command versus other containers or other processes on the node running the same make command. This allowed me to trigger an in place pod resize for my container only when my container issued the make command. That idea was well received. So why is this relevant? Guess what? The answer again lies in C group ID. Now, we wave that EBPF magic wand once again, only this time we do it in Cilium network traffic data path. It starts with the host process sending a packet. In the kernel, it shows up as a socket buffer structure or SKB. We use the BPF helper function BPF SKB C group ID to look up the C group ID of the sending host process. We then use that C group ID to look up the 24 bit host process identity from the host process net ID BPF map, which is preconfigured by the control plane. We then insert this 24 bit identity value into the VNI field, the virtual network identifier field in the journey or VXLan header and transmit the encapsulated packet. The last step to this is at the receiver, we look up the ingress network policy to determine if the sender is allowed or denied to talk to the target. And then we make a allowed drop decision on the incoming traffic. Now, let's understand the demo setup, our demo setup. Continuing on this theme of the banking business for this demo, we have a simple two node cluster here. There are nodes, the master and a node named node one. On node one, we have the bank database part. And on the master, we have three parts, the bank teller part, the bank auditor part and a thief. The bank teller needs to be able to talk to the bank database part in order to do operations such as account credit debit. And the bank auditor part also needs to be able to access the bank database part. In addition, it needs access to the host network for monitoring. The thief is a bad actor who has social engineered their way into getting host network access. Cillium and other C&I solutions have been great at enforcing network policies for regular pods, AKA the bank teller pod and the bank database pod, but not the auditor or the thief pods until now. In this demo, we will see how we can allow the bank teller and the bank auditor to access the bank database while keeping the thief out. I have a video recording of this demo. The smart thing to do is to just play this video, but let's have some fun. Let's do this demo live. The video can be plan B. I'm going to apologize if it's too small for people in the back. I've done my best to expand this up. Let's start by looking at some pod specs. I have three pod specs, three pod spec files, YAML files that we want to look at, the bank teller, auditor, and thief, and then a policy file. We'll look at these as we go along. First, let's look at the bank teller pod. So this file contains a spec for two pods. The bank teller, which has labeled role equals teller, we schedule this on the master node and the bank DB pod, which has labeled role equals database, we schedule it on node one. Let's create this. It's been created. Now let's look at the auditor pod. This has labeled role equals auditor and we schedule it on the master node. It has host network equals a true, which means it uses the host network namespace. Let's create this pod. And lastly, let's look at the thief. The thief is also scheduled on the master node and it has host network equals a true, which means it's using the host network namespace. All right, let's verify that the pods are up and running. We have our two node cluster here with the master node bearing the IP address 10510. Let's look at the pods. As you can see, the bank auditor and the thief pod have the IP address 192.168.105.10, which is the IP of the master node. That's because they're using the host network. And the bank teller and bank DB pods are managed by Cilium. So Cilium assigns these IP addresses 0041 and 10.01141. Now, without any policies in place, all pods can talk to all other pods. In this demo, in this test, we will try to ping the bank database pod from the other three pods. Before I do that, let's watch the traffic that's coming in and going out of the bank DB pod. For that, I'm gonna fire up this TCP dump wire shark with filtering on the ping packets, which is ICMP as a filter, and the IP address of the bank DB pod, which is 10.01141. Let's restart the capture. All right, now the fun starts. Let's first ping from the teller pod. With this command, I'm exacting into the bank teller, and I'm looking up the IP address of the bank database pod and sending a ping to it, one ping only. As you can see, the ping was successful. The ICMP ping echo request went out, and it got a reply back from the bank DB pod. The source address is that of the bank teller, and the destination address is of the bank DB. We look at this field here, the VNI, virtual network identifier field. It has an ID value of 11638. Where does that come from? For that, let's look at the Cilium endpoints. CEP is my alias for getting the Cilium endpoints, and it manages two pods, the teller and the DB pod, and the identity ID value for the teller is 11638. That's what goes on the wire. That's what Cilium puts on the wire. Now let's ping from the auditor. So I do the same thing. I exact into the bank auditor pod and send a ping. The ping is successful. If you look here, it's using IP address of 10.036. Where does that come from? Well, it is IP address of the Cilium host interface. Cilium host interface is what Cilium uses to route traffic that's for host network pods. And it has an ID, this is the local host ID, a generic value of six. Let's try the thief. The thief is also able to ping. It's using the same source address as the auditor, and it has the same network identity as the auditor. So at the database pod, we cannot tell the difference between the teller and the auditor and the thief, excuse me. Now we need some policies here that would stop the thief from being able to talk to the database. Let's look at what that might be. So this YAMU file specifies a simple Cilium network policy that says apply this to pods with label role equals database, and the action to take is allow incoming traffic from pods with label role equals teller or role equals auditor. Let's create this policy, and let's verify. CNP is my alias for getting the Cilium network policies. And we can see it's got a creation timestamp UID, so it looks legit. And the spec is what we expect. Now let's do our ping test again. These are our nodes. These are the pods. They're still running. Good. Now let's clear the clutter over here, and let's try pinging from the thief. The thief cannot ping anymore. As you can see, the echo request went out, but no response. So the thief has been stopped. Let's verify that it works for the teller. So we exec into the teller, and the bank teller is able to ping, so it's working for the teller. And then let's try the auditor. The auditor cannot ping. Why? Let's look at the source ID of the thief and the source ID of the auditor. They have the same identifier, so you cannot distinguish between the teller, the auditor, and the thief. So we need some EBPF magic here. Thankfully, I have a script that does just that. So this little script to do EBPF magic 3, 4, 5, 6, 7, what it does is it takes this value of 3, 4, 5, 6, 7, and assigns this as an identity, as a network identity, for the bank auditor pod. Let's run the script. What it did is it looked up the container ID of the bank auditor, and then it found the C-group ID for that container ID, which is this value here. 1, 6, 3, 4, 7. And associated the ID value that I passed in 3, 4, 5, 6, 7 with that. It created a Cilium identity object for that. And then it entered this value. This is BPF tool hex, pardon me. But it trust me that this is C-group ID mapping to the network identity value. And now that we have this in place, let's try the bank auditor ping again. So this time, the ping worked. And if we look closely, what's changed? The ID. The network identifier of 3, 4, 5, 6, 7 is now going on the wire for the bank auditor pod. Let's verify that the thief has been led through as well. So let's ping from the thief. Nope, the thief is still locked out. The auditor can reach the database, but the thief cannot. And just to make sure that I didn't break the bank teller, let's ping from the teller one more time. And the teller can ping. So with that, we were able to allow the bank teller and the bank auditor to talk to the bank DV pod while keeping the thief out. That, my friends, is the magic of eBPF for you. Thank you. These onions could crash any time. All right. Let's have a quick poll. Can you guess how much code I had to write in the network traffic data path in order to figure out the identity of the process that is sending the packet? Can I please get a show of hands? How many people think it's more than 100 lines of code? No? No way? OK, a few. How many people think it's more than 10, less than 100? OK. You're all very smart. How many people for less than 10? That's the only choice left. There are a few people who I don't know. OK. Well, let's take a look. This is what it took. That's it. The fact that it is so ridiculously simple to make this happen speaks volumes about the maturity of a psyllium for this purpose. Now let's switch gears a little bit. As we saw, identity-based policy enforcement for host processes is quite cool. But we cannot go home just yet. Think about it. What happens if the container jail breaks and acquires the capability, BPF capability, on the host? It can manipulate the maps, and then we may be in trouble. So we still need a defense and depth strategy that's grounded in zero trust principles. And to that end, we have been using the trust fabric solution within eBay for a couple of years. And it evolved for the last five, six years. It started around the time Spiffy and Spire started. It's been working well at eBay scale. And it's kept us legally compliant with the payment card industry, PCI regulations, the security requirements that they ask for. In terms of, at the core of it, this solution leverages JSON web tokens, JOTS. And in terms of performance, it does over 36 million token generations and not of 4 billion token validations each day. We're working to open source this code. I'm pushing for that. And I hope to present at a future conference if we get the chance, present our segmentation solution that combines layer seven trust fabric policies with layer four cellular network policies. Now, so to summarize, securing apps at scale, it is a challenging problem. And we need to have an efficient way to enforce policies. Fortunately, EBPF continues to shine and show its power in this space. We broke new ground here today in extending layer four identity-based segmentation to host processes. Of course, we plan to contribute this feature to upstream Cilium. We have a bunch of issues to resolve, a couple of cases to figure out. After we are done with that, we should be able to get there. The bottom line is that we need layers of security or what's called as a defense in depth strategy. Zero trust architecture. This is where the industry is headed. And with this, we are aligning ourselves in that direction. And with that, I will conclude this session. Thanks a lot for attending. But before you go, I have one request. Please scan this QR code. It will take you to a place where you can leave feedback for me. Please let me know what you found useful or what I could have done better. I love to hear back from you. And I look at the feedback, and I try to act on it. So please, please, visit that link. Leave feedback. Thank you. And thank you. I can take some questions now. We looks like we have some time for that. Hi. Thanks for the talk. The question I have is the text needs to work because you have this magic that they say. How does it work in practice to make this production ready? In practice, what we would need to do is there are two parts to it. So it breaks down to the identity configuration of the identity. So one of the things I covered, touched upon there, is that cilium identity object. So we need to create that. And that's the part where the control plane for plumbing comes in. And we looked at it closely, looked at our application space, and how do we do that? That's where we were trying to figure out what's a good way to figure out the identity values, given that we have so many applications. We have X number of applications. And the number of instances, so an application like Kubernetes API server, you could have five instances in a typical cluster that has a production cluster. Now you take that to some other applications. How do we map the identities? That is the harder part of the problem that we want to solve. And the control plane interface, that's what we are trying to define, which will be very generic. The question about what this script does, it assumes that part is plumbed in. So if you notice, I passed that argument value, 3, 4, 5, 6, 7. This script is essentially taking that. And it's like pretending that this whole control plane plumbing has come in. That's going to be a lot of code. In the data path, we need to, this was one particular case. So this is a communication that's happening cross node. So we are talking from host process, host network pod that's on the master to node one. We also need to resolve cases like, what if the host processes are on the same node? What if a host process is talking to a kubelet? A kubelet is talking to a pod. That is allowed by default because health checks run in that. So we need to ensure that we don't break it by doing this. So there are these several corner cases that we need to resolve. It's not a whole lot of work. It's about prioritizing and bandwidth. So you may imagine how things change a bit here and there in terms of priorities. This was not a very high priority project to begin with. It was like, OK, you just joined eBay. Here is this fun project for you to do to get warmed up. Yeah, that's what it was. Yeah, I hope that answered your question. Couple of questions. The second one is 3, 3 related to the first one. In order to inject the identity into the VNI, I assume that you have to build your EBPF on the host, right? Yes, it is done by an EBPF program. OK, what if in a production environment you are not allowed to build anything? So are you talking about permissions? Yeah, having the capability to build the EBPF program, injecting the identity for that particular pod or host process and attach it to the program itself. Attaching, of course, must be allowed, but probably build the EBPF node. Yeah, what this would look like is this would be an extension of a new feature in Cilium, which already has the EBPF capabilities. So Cilium agent can, it has full access to the system because it's a core infra component, which means it has EBPF Sys capability, EBPF. It's allowed to attach-detach EBPF programs and do a bunch of additional things that it needs to do as a system, a critical system daemon. Now, the part about how do you specify identity for a host process that can be as simple as an annotation on the host process. Let's say you want to specify, in this case, we have a host network pod, the bank teller. There could be an annotation, which what I did, the do EBPF magic script. Instead of that, in the bank teller pod, you set an annotation that says my network identity is 34567. So that is typically how the control plane would look at it and implement it. I didn't do it in this case because this was easier to show. That's kind of a little bit hidden behind the scenes. I did that in the demo a couple of years ago using pod annotations where I specified the resources. It wasn't as readily seen. This is a lot easier to see. So for demo purposes, this is one. But yeah, I hope it answers your question. That's one way to do it. That's not the only way. But typically, we make liberal use of annotations. It's a very convenient feature that's there in CM. This would be one of those, when in production environments, the identity would be looked up by various parameters. Okay, this is an application. What namespace is it running in? Application namespaces are typically unique. And then you could have the cluster that varies. It could be a staging test cluster or a dev cluster or a production cluster. And they would have different identities. So that would be something, some kind of an admission control hook would inject based on, it would look up some kind of a central database and then inject that ID into that so that the application developer doesn't have to specify that. So that mapping has to happen. So the control plane part of the work is significant. And really, if you think about it, Cillium just needs to provide interfaces to make this happen. Because our control plane, how we do it may be very different from how you do it or rather a third company would do it. So we cannot, you know, become opinionated on that part. Sure, thank you. Sure. After reading the description, I had an impression that the talk wouldn't be focused on host network containers, but not host network containers but on host processes. Oh, it's the same thing. So I believe in the abstract, I did write both host network, host network pods as well as host processes. This can be applied to host processes as well. So the caveat with that is that it would apply to the long running host processes. The difference is, we're looking at the C group ID. In the case of the, when you create a pod, the container D or the cryo, your container runtime creates the C group for that pod. In the case of host processes, like an example would be the kube API server or kubelet, the C group ID is created by systemd. So if you look in the BPF map paths, you'll find, you'll see where it looks up systemd.slice and that under that, you'll find all the host processes that are running. Now, if it's a very short lived host process, that is one which we cannot, I don't know, if we have super fast way of figuring out the control plane and then plumbing it, it could still happen but that's a case that I've not really considered. What we really were looking at is, the use case here is Kubernetes API server in cluster A should only be able to talk to Kubernetes kubelets in cluster A and not cluster B. Thanks for the reply but my question is, what if I got introduced to some vulnerability after a node update, like through my package repository? So I would like to centralize setup firewall to our host process outside of Kubernetes. Right now we're using Ansible for that and it would be nice to have like NetVersion policy like actually describing the firewall at the nodes. This, yeah, this doesn't preclude or prevent net scope kind of firewall from being there. This doesn't, it can continue to work. So that is the idea of defense and depth strategy, right? You have multiple layers of security. You kind of assume that a motivated attacker or a hacker will find a way to get into your network. So you can't just say, okay, I just, I have my VPN, I'm all set. They could break into that and then do havoc. What you need to do is limit the blast radius. What would be your advice to group our CG groups inside Kubernetes and distance them from all CG groups like in system DN outside of Kubernetes? There is no difference. It's a kernel, it's a kernel construct. C groups is a kernel construct. Every single process gets, gets itself. Yeah, but how to know like, because there's like random ideas, how to distinguish them in real time. Oh, by using the BPF for SKBC group ID, that function call, that is the easy part. The hard part is when that lookup happens, does the map entry is that configured? So that configuration can be done easily for long running processes, which is, which we want. As pods have a life cycle, there's gonna be like a pre-start and there's a bunch of things that happens. You can hook it to one of these and do it for pods. And system deep processes like Kube API server, Kube API server is really a pod, but system D is a KubeLit service. KubeLit is a system D service. There is a system D file that goes along with it and system D is responsible for its running, running this or its health, keeping it alive. And those kind of processes, the CG group ID is created by system D. It's managed by system D. And you can reliably look them up. Hmm, okay, thanks. Thanks a lot. I hope that answers your question. We are out of time, but I'm happy to, I think there's another talk after this, so I'll step away from here and I'll be around here in case you have more questions.