 All right, hello, welcome everyone. So one of the principles that we've had in Selium since kind of the beginning is this idea that if we're to build the networking from scratch for Kubernetes, how are we going to do that to make it as efficient as possible? And BPF is really a key part of this because it allows us to extend that flexibility all the way down into the kernel so we can look at holistically at the entire sort of networking implementation and figure out how should we implement each networking piece to efficiently handle events that are happening within the system. So today we're gonna talk about network policy and we're gonna focus particularly on this idea of like scalable enforcement. So when I talk about scalability, I'm talking about like doing as little as possible each time something happens within the system. And so there are different ways you can kind of optimize this so you can reduce the number of events in the system and that's certainly one focus area of the project. At the same time, you are still going to have a certain number of events. And so for each of those events that are happening in the system, you need to be able to minimize the amount of work that you're doing in order to implement an efficient data plan. And of course in the context of network policy, we need to actually enforce that policy so the user has a certain intent about who should be able to talk to who and we need to carry out those instructions to have a secure implementation. So when we look at a particular cluster, we have various different events that are happening in that cluster. So there may be control plane events. So things like state distribution between the different nodes, between Kubernetes API server and individual nodes. And of course this is scaling, we're thinking about hundreds of clusters, thousands of nodes, perhaps tens or hundreds of thousands of workloads. And so minimizing the amount of work that we're doing in order to handle the network policy implementation across that sort of scale. And then from an individual node perspective, as you scale those workloads, those workloads are talking to one another. And so if we think about data plane events, we've got perhaps millions of packets per second that are running through the system. And so at that level of scale, like how are we minimizing the amount of work that we do to still scalably enforce the policy? And so as I mentioned, there are different areas of focus that you can do to optimize how this is going to be the most efficient implementation that you can do. Today we're gonna focus particularly on an individual node's perspective. And so that's influenced partly by the control plane events that are happening in the system. But then it's also, matters how the data plane events are being handled by Selium. So we're gonna focus on three areas today. So first we're gonna talk about how we apply the policy efficiently. So we're talking about taking like a user facing high level policy human readable and converting that into a machine friendly implementation in BPF that they can then implement that policy. Secondly, we'll talk about how we're efficiently enforcing that policy. So as packets come into the system, what are we doing to be able to enforce that network policy at the data plane level? And finally an important part of scale is how you have the tooling to be able to understand what happens when something goes wrong. So we'll talk a bit about the debugability tools that are available in the system. So with that I'll hand over to Hamath who can talk about applying policy efficiently. Thanks Joe. So if you want to enforce network policy with Selium our users currently have two major options. So you can either write network policy with Kubernetes network policies or Selium network policies. So if you've ever written a Selium network policy some of this might look familiar to you but primarily the Selium network policy has three major components. There's the subject workload that allows users to define which workloads this network policy should select. And then there's the target part which allows users to specify in the selected workloads which traffic related to that workload should this network policy be enforced on. And finally the action part which allows users to specify what should Selium do with the traffic that was selected by this policy. So before we understand how this policy is actually implemented we need to understand Selium's identity based security model. So what this really means is it allows users to express network policy in terms of who is allowed to talk to whom instead of having to worry about where a workload is deployed on or what IP addresses are associated with it. So in this example I have three entities that are tie fighter, dead star and X swing pods. And these are deployed in Kubernetes as pods. And in Kubernetes every workload with a unique set of label gets its own security identity. And because it's a lot more efficient to store and compare numbers every security identity gets its own integer ID. So this is a really powerful concept because it allows us to represent network policy. It allows us to have a network policy representation in the kernel that does not get impacted by podchon. So your pods can come and go but our identity allocated for each of these entity does not really change. So our goal for this section is to understand how the network policy that is defined by our users in terms of CRDs is converted into an identity based allow list in the kernel which will be used to enforce network policy. So every Celium agent that is running in the cluster registers a watcher with the Kubernetes control plane to get updates on Celium network policy or network policy objects. So every time a user creates a network policy object these updates are sent to every single node in the cluster. It might sound a little counterintuitive to send or broadcast updates to every single node but it's important that we have all the latest policy that is defined in the cluster already available on the node so that when a pod is scheduled onto that node we can enforce the time required to enforce that policy as minimal as possible. So when the agent receives an update whether it's from Celium network policy or Kubernetes network policy the first thing that the agent does is convert that into a standardized internal representation and store that in the policy repository. So the implementation detail so Celium actually implements both the policies in the same way. So if you take a look at a naive algorithm on how a policy computation algorithm might look like for every policy that is known in the repository and for every endpoint that's running in the node for each and every protocol and direction combination and for every security target you might want to allow traffic to you need to have an entry in a BPF map. To illustrate how much volume this can generate our users today might be writing thousands of clusters thousands of policies in a cluster and every single node might have hundreds of pods running on them and you might want to allow traffic to tens of thousands of security identities depending on how many workloads are running in this cluster or how many clusters are meshed together. So you can see here that very easily you might have hundreds of thousands of BPF map entry updates that need to happen on every policy iteration loop. So how can we do better here? Maybe we don't have to process every single policy update event we can be smart about it or maybe we can cash in reuse all the intermediate computed state and maybe we can employ some efficient data structures to do fast lookup and efficient computation. So let's look at a few common scenarios that the agent generally has to deal with and we'll walk through what the agent does when it receives each of these updates. So the first update is policy creative event. So let's imagine a scenario where you have a Kubernetes cluster which has no workloads on it and a user creates a network policy and I call this policy dead star empire access. What this is basically defining is we want to allow ingress traffic from any pod that belongs to our empire from pods that have the labels org empire and class dead start. So basically we want to allow traffic from dead start to any part that's running in our empire. So the first thing agent does is it basically takes the contents of the CNP and converts that into a standardized representation as we discussed before and it's stored in the policy repository and every policy gets a revision number. We'll talk a little bit more about how this revision number is used. And the next part is we want to allow ingress traffic from pods that have the labels org empire, right? So if you want to convert this section into a BPF identity, a love list, we need to convert these org empire labels into its corresponding identities, right? So every time we want to convert a set of labels into its own identities, it's a very expensive operation. So CELium tries to cache every time it tries to convert org empire into its backing identities. So this data is stored in something called a selector cache and at this point, we have no workloads running on this cluster, so it selects no identities. Now let's imagine there's a new pod called dead start one that's scheduled onto this node and this pod has labels org empire and class dead start. And because our cluster has never seen the set of labels before, it gets allocated a security identity called one, two, three, four. And this identity gets stored in something called as an identity store. So now that we have a pod with an identity one, two, three, four, and it also has org empire, the selector cache entry is now updated to reflect one, two, three, four. So every CELium Asian pod has a dedicated worker for every endpoint that's running, every endpoint or pod that's running on this node. So the responsibility of this worker is to basically make sure that the policy representation for that endpoint is up to date with the latest revision number that's present in the policy repository. So what this worker does here is it gets the latest known policy and it compares the labels that are present on this pod with the labels that are expressed in the network policy and it comes up with the relevant set of rules for this specific endpoint. And again, this is also an expensive operation. So CELium Asian tries to cache that in something called as policy cache. And what we can do here is we can cache that against the identity of the pod so that this allows us to reuse this computed policy the next time we have a pod that looks like this. And CELium Asian also does something interesting here. So it goes back to the selector. So soon after the entry in the policy cache is added it goes back to the selector cache and updates the references to entries in policy cache for every entry that selects org empire. So finally, we have all the data we want and we can join the data between selector cache and policy cache to compute the final identity I love list and update to this endpoint's PPF map. Now what happens if you have another pod called deadstar2 and this pod also has the same set of labels. So it inherits the identity one, two, three, four and we already have all the data that is necessary to compute policy for this endpoint. So we simply join the data and update the contents into this endpoint's PPF map. Now, so far we've been talking about events that are happening on just one node but what happens if there's a new pod that's created on a remote node? So let's say there's another node called node two and this time there's a pod called tie fighter and this has the labels org empire and class tie fighter and because we have not seen these labels before, Selam allocates a new identity for this called 2345 and the node one instantly gets notified about this new security identity and it gets updated in its own local identity store. And this is the state in node one before it received an update about 2345, right? So now that we have another identity that selects org empire, this gets updated with 2345 as well and remember we spoke about how selector cache has mappings back to policy cache. So now from selector cache, we can quickly jump to entries in policy cache that are relevant here and this allows us to very quickly compute only the new entries that need to be updated to the PPF map. So there could already be like thousands of entries in a PPF map, but all we need to compute is entries for identity 2345. So this allows us to have really fast incremental policy updates. Now what happens if there's another part called tie fighter two on a remote nerd? We already know the labels for this part so it inherits the identity 2345 and we don't have almost, we don't have anything to do on node one for this. All the policy that is necessary for 2345 is already in place. So from a policy computation perspective, there's absolutely no work that needs to be done. So to recap the section, what we learned here is that Selam agent uses, Selam agent tries to cache all the expensive operations during policy computation and we are being smart about when we regenerate the policy. We only generate the policy once, the first time we see a new identity and we're also using efficient data structures to make sure we are able to quickly regenerate policy, incremental policy. So now I'll let Joe talk about how this computed policy is used in the data path to enforce the network policy. All right, so taking it back to the high level, we're kind of trying to say, how are we doing as little work as possible any time events occur in the system? And so as we're getting down to the kind of per packet level and how the kernel implements the networking behavior, I wanna focus on two particular areas here. So one is in provisioning the BPF programs to actually run every time the packet event occurs. And secondly, I wanna go through a bit about the different functions that are happening at the data path level and talk about the optimizations we put in place to make that efficient. So firstly, if we look at what's happening when a pod gets scheduled onto a node. So the kubelet gets the scheduled request and reaches out to the container runtime to be able to create the sandbox for that pod, ready for the application to run. And the container runtime calls out to the networking plugin to actually attach that sandbox to the network. At that point, the CNI plugin, Cillium here informs the Cillium agent that it needs to load and attach the BPF programs to actually run on those events in the system. So we're gonna zoom in a little bit more on this last part of the diagram here. So what are we trying to achieve? So first, we're going to take all the pod metadata that we have for this particular pod. So things like IP addresses, the security identity, and other configuration for that endpoint. And if we take all of that information together, we can actually tailor the individual BPF program for the particular endpoint or the particular pod. And so by doing things like embedding the security identity directly into the assembly instructions at BPF, we can make this incredibly efficient. And so once we've compiled that BPF program for the particular endpoint, we can then load those BPF programs into the kernel, along with map state, which is more dynamic state that the kernel needs to be able to perform networking functions. And then we attach that program to the event like a packet processing event. Finally, once we've implemented all of this, we can return from our CNI add call and the network is ready to actually handle traffic and container runtime can start your application. So one of the things we noticed when we actually started to measure this is that the compile step error was taking a little too long or longer than we'd like it to take. And so we put metrics in place to be able to measure the different phases and we identified the compilation step is taking too long. Now with this compilation step, what we were trying to achieve is for every single packet happening being handled in the system, we want to minimize the amount of work that we want to do in order to handle that packet. But then we introduced this trade-off here where the pod turn rate would then be increased. So we kind of wanted a way to be able to actually achieve both of these things. And so the solution we came up with was what we call ELF templating. So the idea is that when you compile the BPF programs in the very first time on the system, we store references to where the security ID entities and IP addresses and so on are within the BPF instructions. And template that. And so what happens when a second or later pod gets deployed onto the node is that we can then take this ELF, we can substitute the individual BPF instructions, security identity and so on. And then we'll get all the efficiency per packet that we got before, but we've also drastically decreased the amount of time that it takes for that pod initialization to occur. So with that in mind, what we've finally arrived at is what happens when an individual packet actually gets sent through the system. And so in this case, we've got the tie fighter and it sends a packet that's trying to reach out to the death star. And so there are various different network functions that are necessary to implement policy. So I'm not gonna go in a deep dive of all of the data path here, but some of the key areas that are important. So first thing, if the traffic is destined to like the death star service, then we need to load balance that traffic to an individual backend. Because then with the backend, we can determine the security policy for that particular traffic. Secondly, we go into what we call the connection tracking table. So at a high level, when you're defining a network policy, you tend to express it in terms of something like a tie fighter can talk to death star. So this is inherently connection oriented. You're talking about one particular application talking to another, but at the packet level, you actually have two way communication. You have reply packets sending back to the original tie fighter. And so when we want to implement the policy to say tie fighter can talk to death star, we actually need to allow the replies back as well. So what this connection tracking table does is it allows us to associate their reply packets back to the original connection. And then we can just express the policy purely in terms of the original direction traffic. So now we move on to determining the identity of the target traffic. And so there are various different ways we can do this. We can pull this information out of headers from a tunnel packet, or we can distribute the state between nodes through the control plane. And so this will depend on whether you've got Ingress, we're applying this traffic on Ingress, or Egress or tunneling mode and various different things like that. And finally, with now that we know everything about this traffic, we know who's talking to who, we can then look up this policy map. And this is all the stuff that Hemantz had described earlier, and we can efficiently figure out should we drop the traffic, should we allow the traffic, should we redirect it. And after that, we can apply things like authentication and route the traffic to where it's going to. So I wanna look into a few optimizations we've put in place at individual steps through this process. So first up, if any of you are around for Martinez's talk in San Diego, he did a deep dive into how Cillian's Coupe proxy replacement works. And so one of the core ideas here is actually to move the load balancing decision up to the socket layer so that when it comes to per packet processing, we actually have nothing to do anymore. So there's nothing quicker to do than absolutely nothing. So from a per packet speed kind of perspective, that's really fast. Secondly, so we use what we call a LRU-based hash table for the connection tracker. So what's interesting about the connection tracker as a kind of part of a networking system is that it's really tracking all of the active connections that are running through the system at this time. And so depending on the scale of your system, that table can end up being very, very large. And so one of the problems we have is like, how do we efficiently keep that table accurately tracking the active connections that are going right now and then involves things like cleanup? And you do spend some CPU cycles on trying to clean up this table. And so by using a hash table with the least recently used property, when a new connection arrives, we can actually automatically bump out old connections from that table, and then the garbage collection process is much faster. And finally, if we look at the policy level, we wanna try to ensure that when we enforce policy, it has a constant time performance characteristic. And so if you have thousands and thousands of policies at the Kubernetes layer, we don't want to map that down into the data path so that every time millions of packets are flowing through the system per second, we don't wanna be running through thousands of rules to be able to determine should we allow the traffic or not. And so all the work that Imantha talked about is about amortizing that cost and building up this map so that when we identify who can talk to who, we can prepare all the information we need to make that decision in the data path so that it can happen in constant time. So to wrap up the section, so we talked about embedding pod metadata, things like security identity, like directly into the instructions, about how we measure and amortize the costs of pod turn, and then how we can, like different strategies we've used to the data path level to optimize the performance, whether it's triggering on events with a lower frequency or mitigating the CPU usage for maintaining tables or avoiding linear iteration at the actual per packet level so that we're doing as little work at the packet level as possible. So with all that in place, Imantha's gonna walk through debugging the policy. So every once in a while, you might want to validate if Selium is enforcing the network policy, the intended network policy correctly, or you might have users reaching out to you asking you why they're unable to talk to their destination service or why they're seeing timeouts to the specific IP address or tens of other variants of the same question. So luckily, Selium from the get go has invested in some tooling to inspect and debug every stage of this policy computation process. So in this section, I'll walk through some of the tooling that already exists today and how we can look at some of the sections we talked about in the earlier sections. So the first step is if you want to inspect all the policy that a given Selium agent knows about, we can exact into the Selium agent process and there's a binary called Selium debug and we can use this command called Selium debug policy get. This dumps all the known policies on this specific Selimation process. And we spoke about selector cache. If you want to inspect the in-memory selector cache, we can use Selium debug policy select, Selium debug policy selectors get to look at every selector and the identities that it maps to. And you can see the identities one to three, four and two, three, four, five. We spoke about earlier. And we don't have a direct way to look at the policy cache, but you can do Selium debug endpoint list to get the list of all the endpoints that are running on the node. And once you have the endpoint ID, you can get the full contents of this specific endpoint. And one interesting property here is that you can see that the derived from rule section has a policy called dead star empire access. So if you're curious on what policy is being selected on this given endpoint, you can always verify it from here. And finally, if you want to look at the computed BPF-based allow list, you can do Selium debug BPF policy get. This is a direct representation of the data that's written to the per endpoint BPF map. And what you see here is the target identities are already converted into the backing labels. But if you want to look at the raw contents, you can always use BPF tool map dump to get the exact content, to get the raw content. And Selium debug has a lot more sub commands like this. One more important one could be Selium debug endpoint log, which gives you all the recent logs that have happened in the context of one specific endpoint. And there are also a lot of metrics that are available. You can monitor Selium BPF map pressure. Martinez had a demo about this earlier. And you can have metrics on policy enforcement status to understand how many endpoints are currently in enforced mode. And you can also monitor your policy enforcement delay. So I'd highly recommend checkout docs.selium.io to look at all the command reference that we have. And if you haven't used this already, there's a quick shout out to this tool called Hubl Observe. This is more user-facing, so you can look at which identities are talking to, what destination identities. And if you are a developer, and if you want to get more details on why exactly you have a drop, you can get references to the exact line number on the BPF code to understand where the drop is coming from. All right, so hopefully you've got a bit of the picture about how we make scalable policy enforcement by doing as little as we can. We never events occur in the system. So we talked about how to apply the policy efficiently, how to enforce that policy efficiently, and how to debug that system efficiently. So if you're interested to discuss more of this, come and have a chat with us. We have a booth. And Himantz actually has another talk on scalability later this week. And of course, we'll be happy to receive your feedback on this talk. So thank you.