 Hello everyone. I'm super excited to see all of you interested in scalability. So we will be talking about scaling Kubernetes networking to thousand five thousand and hundred thousand nodes. I'm Marcel Jumbo. I work at Isovolent and together with me we have Dorda Lepczaevic from Google. So let's start with Cilium. What is Cilium? Probably most of you heard about Cilium and most popular is Cilium CNI, which is Secure Scable CNI plug-in that you can use for Kubernetes. But then except for Cilium CNI, we also do have Havl in case you are interested in network observability and last but not least Tetragon. So Tetragon allows you to to basically secure your container runtime with policies similar to network policies. And what all of these projects have in common is that they utilize eBPF. So short introduction into eBPF. You can think of it as you can write small program that can be attached to different events within Linux kernel and what you can do with it is for example, if you are interested in observability you can export some of the information from the kernel to eBPF map and then from user space you can access that data, but it also works the other way around. Let's say that you want to implement service load balancing in Kubernetes. What you can do is actually take service, take all the backends that are behind the service, write those IPs into eBPF map and then have the eBPF program translate cluster IP into one of the backends. And it's one of the most important parts of eBPF. It's that it's way more efficient than other alternatives. So we'll be focusing today mostly on Selium CNI. Short summary what Selium CNI is. First of all efficient scalable Kubernetes CNI that also provides security Kubernetes network policies, but also more advanced Selium network policies. As mentioned before, service load balancing. So if you are interested in KubeProxy replacement, you can use Selium to do the service load balancing instead of KubeProxy and last but not least which we'll be covering later on as well as multi cluster. If you are interested in running multiple clusters and connectivity between different clusters, then you can utilize Selium as well. But let's start with even understanding what does the scalability even mean? So our title mentions hundred thousand nodes, right? But the scalability is not just the number of nodes. It's way more than that. So when we are thinking about scalability of Kubernetes, there are so many more dimensions that we care about. Notes is just one dimension that we care about, but also like how many posts do you have? How many network policies? How many services? Or how many backends those services have? And with that in mind, we really need to think about all those dimensions when testing networking and scalability of Kubernetes. But then again, okay, let's say we have some rough numbers what we want to test in terms of scalability. What we need to do is understand when our cluster is happy, but what does it mean? Well, we need to have some SLIs, SLOs. So I mentioned here like few SLIs and SLOs that we care about when scale testing, networking, and Kubernetes. One example is pod startup latency. Well, we care how much time it takes for your pods connectivity to be up. Similarly for the node, we also care about network programming latency. So you can think of it as how much time it takes for the backends behind services to be propagated across cluster if you apply network policies, how much time it takes for those network policies to be applied, and last but not least in cluster network latency and throughput. These are just few examples. There is way more than that, but those are kind of mostly related to the networking and some of them we'll be covering later on as well. And with that in mind, stage is yours, Dirda. So I'm going to talk about network security on a large scale, particularly how workload security with Kubernetes network policies can scale up to 5,000 nodes and 200,000 pods in a single cluster. What I'm going to cover is the target scale that you want to achieve, how network policies are implemented, what challenges we overcame and how, the performance and metrics, and improvements in progress. So first, as Marcel mentioned, they're a number of scalability dimensions that we care about. And in this case, we want to be able to support up to 5,000 nodes, 200,000 pods, 100 pod changes per second, up to 10,000 network policies, and 20 changes to network policies per second. First, we'll start a little bit about network policies, what they are, how they're used, and for them, an important detail is security identity. So security identity is generated from pod labels and namespace labels, and network policies select these pod labels and namespace labels, the same ones, and then later, pod-to-pod communication is allowed only between pods that have selected identities. To illustrate how network policies are implemented, we can look first at the control plane, which is done by a Cilium agent, that is a daemon set that runs on every node. It watches some Kubernetes API resources, like the standard ones, pods, namespaces, network policies, and Cilium custom resources that are derived from pods and namespaces, Cilium endpoints, and Cilium identities. And with all of these data, it can populate EBPF maps that are the first EBPF map that is per node, it is IP cache, and it maps pod IPs to identities, to the security identities that I've mentioned, and then we have EBPF policy maps that are per endpoint, so every pod has one, and it consists of which identities are allowed to communicate with this pod. And then there is the enforcement of policies. On the right side, we see the data plane when pod A is trying to communicate with pod B, and there is a network policy on the side of ingress of pod B. When the packet is coming in and the network policy enforcement system is triggered, it will try first to map the IP of the incoming pod to an identity, and it will manage to find it here in this case. And then after that it will try to find if this identity exists in the pod B's policy map, and if it does, it will allow traffic, and in this case it does. If it didn't, it would just drop the traffic, so it would be dropped. So what was the bottleneck when we are trying to scale network policies to 5,000 nodes in 100, 200,000 pods with 100 pod changes per second? So the thing is that in order to program eBPF maps on every node, every node needs to know about IP to pod security identity mapping for every pod. This means that if we are changing 100 pods per second, we will have to send 500,000 events per second to all of the pods, one event. And the Qube API server on the most powerful machines cannot handle that much. Over 100,000 events per second are already troublesome, so the safe limit in this case would be about 1000 nodes for 100 pod changes per second. And what did we do to overcome it is implement batching selium endpoints, and this was inspired by Kubernetes endpoint slice. This is done in a way that we are trying to slice the entire pool of endpoints. In case of endpoint slices, we are reducing the size of the endpoints object, the Kubernetes one. And here we are just reducing the number of events by batching the endpoints into a group of 50 for the slice. And there is a graph on the right which illustrates how the diagram that illustrates how it's done. So when pod is created, we see on the point one, there is selium CNI add. And at that point, selium endpoint is created in the Qube API server by the selium agent. And the selium operator has the job to batch selium endpoints into slices. Selium operator is deployment that can run in the cluster. And it will see this selium endpoint creation and it will batch it inside of a slice. And also create, post that update. And then the point five is when all of the selium agents in the cluster receive this selium endpoint slice. And all of them will be able to update the IP cache map that I mentioned in the previous slide. So they will all know about this new pod's IP to identity mapping. So why does this work? Why does it work that we are batching endpoints into slices? The main issue is that Kubernetes control plane is heavily impacted by having too many events. Yeah, when we have 500 000 events per second that Qube API server needs to handle, the performance is dropping. And in some cases the watches are being terminated and their troubles in the cluster entirely. Sending fewer large requests enables Qube API server to handle the 100 pod changes per second. And the selium endpoint slice contains the minimum amount of data for selium endpoints. Having security identity only to be a size of 64-bit integer instead of a list of strings, which can be a lot, significantly reduces the size of each endpoint. One slice contains 50 endpoints and a full slice is on average five times smaller than 50 selium endpoints. What about performance? So as I said already we've demonstrated that the scale will grow from 1000 nodes to 5000 nodes with the same turn rate on the pods. And also the number of pods can increase because we have some limits on how many pods per node can be. And in this case we are in general supporting 200 000 pods because we are recommending about 40 pods per node on this large scale. And another thing we can see that a selium endpoint slices updates are rate limited to 10 per second because still we can overload the API server and we need to rate limit it. In this case we are sending 10 times a lower number of events. We used to send with selium endpoints up to 500 000 but now we are sending 50 000 and also we can support up to 500 pod updates per second. In worst case scenario when you would suddenly want to update the security identities of all of the pods in your cluster it would take up to 400 seconds in this case. So it's a little bit less than seven minutes. Okay so now we are going to look at which metrics we are using which SLIs we are having to look at the improvements that we made on the bottleneck which is to propagate selium endpoints to all of the pods, all of the nodes and for them to see the IP to identity mapping. So selium endpoint propagation delay metric exists in the selium agent and represents in this case another policy enforcement latency because policy programming latency on each node takes very low time which is less than five seconds after propagation of endpoints regardless of scale. And the diagram we see here is a four-step process where the start time is selium endpoint creation and end time is selium endpoint received through selium endpoint slice and that is the entire delay that selium endpoint propagation delay metric is showing. What are the other challenges that we are currently facing? So for network policies to work at a very large scale it's possible that there will be too many selium identities and there are some limits there. So selium identity equals the identity of one security identity for pods and in this case there is a hard limit of 65,000 security identities per cluster this is just by design but the bigger restriction is that we have per pod BPF policy map which is 16,000 security identities for each map and there are ways that can trigger this that can make you scale up to over 16,000 pods is 16,000 security identities and that is for example if you have unique label sets for most of the pods or all of the pods so even just having 16,000 pods can in some senses break the cluster where there will be the new pods will not be able to start because the eBPF policy maps will not be able to be populated and also there is another problem identity duplication which comes from distributed management of selium identities where selium agents are trying to create identities for the same unique label set at the same time and they will create different ones and there will be duplication which is causing that there is even more security identities in the cluster and one ongoing issue is the namespace label change because security identities depend on namespace labels it happens so that if namespace labels are changed all security identities for the pods in this in the namespace that was changing labels will have to also change so what are the improvements we are currently working on the first one is centralized identity management we are moving identity management to selium operator from selium agents this resolves identity duplication reduces pod startup latency because selium identities are created on pod creation now instead of selium endpoint creation security improvement selium agent loses permission to write the selium identities no longer has a vulnerability there and enables further improvements and optimizations this will enable us to start using security relevant labels filter even without restarting selium agents so it means that it will greatly reduce the number of security identities in some cases we are able to reduce them by 100 by a factor of 100 and another thing is improving performance and reliability depending on scale by adding dynamic selium endpoint slice update rate limiting we want to try to also rely on priority and fairness of kubernetes but for now we still need to have this rate limiting and dynamically scaling up and down and down based on the number of based on the size of the cluster and the size of the master vms the kubernetes control play vms is going to work another thing is faster policy enforcement for system critical pods this is achieved by priority propagation of selium endpoint slices then we have a reduction policy enforcement latency on a large scale which is not something that we've been working on but in kube api server there is an optimization coming soon for processing events for many watchers in kubernetes 129 and now i'll bring it back to marzel okay thank you so we heard quite a lot of a lot of different problems that we have when having single cluster and now the question becomes like what else what else can you do if you are facing some of those issues so i would like to talk about cluster mesh so if you are interested into higher scale what you might consider is actually the cluster mesh to connect multiple clusters so what is the cluster mesh cluster mesh in a nutshell provides you with connectivity pot-to-pot connectivity between the clusters and then on top of that what you are getting is all the benefits of selium so starting with network policy enforcement that works across clusters but also transparent service discovery so you can easily share services between clusters and and of course for example transparent encryption if you are interested in this topic so let's go through two different use cases to kind of show you what's possible with cluster mesh so one example is let's say you have two clusters and you have backends deployed to both of them and let's say that you misconfigured your your backends in one of the clusters so what selium can do and cluster mesh it can actually redirect automatically those connections to other cluster in case of misconfiguration so if you are interested in high availability that might be option for you another case let's say that you are managing volt service like in this case instead of deploying it to all of the clusters that that you manage what you can do is have single cluster where you can manage this service and then just simply expose it to to other clusters that can then later utilize this service so we were thinking about scalability of cluster mesh and our first initial architecture was quite simple that we developed for it the idea was that agent and operator they are just writing the data to Kubernetes control plane including selium nodes identities and endpoints and then later on what was happening is that the cluster mesh control plane was taking all of that data and writing it down to to the etcd and as you can see here the etcd then later is exposed to other clusters so agents in remote clusters are watching for all those changes in nodes identities endpoints in order to provide the connectivity between clusters and with that initial architecture what we did we did scale testing so similarly what we were interested in was scale of 255 clusters 50 000 nodes around like 50 nodes per second of churn endpoint propagation up to 100 endpoints per second and half million pods in total so with that in mind we prepared our test bed for for testing and we were interested how much time it takes for the data to be propagated across across clusters and what we saw with this initial architecture was that in the above graph you can see like what's the number of nodes and as it is increasing you can see more and more spikes which basically mean that the data propagation between clusters was getting a little bit worse but what's even more concerning is that at some point around like 35 000 nodes 40 000 nodes in the whole cluster mesh we actually started to see that the data was not propagated at all so we were wondering like okay what what was going on right so we took a look at the etcd metrics and what happened was that whenever we were scaling the number of nodes we can see that the cpu usage was increasing and at some point it just skyrocketed to like 50 cpu cores and similarly memory usage was increasing and what we found out was that when we took a look at the number of watches that were opened to etcd there is this sudden drop so and it correlates strongly with with the issue that we saw with data propagation between clusters and so what it means is that etcd at this scale was really struggling and was unable to actually handle the data propagation so we were thinking like okay so with that in mind like what we can do what we can do to actually improve the architecture of cluster mesh and make it reliable at high scale um so one more thing the bottleneck was as I mentioned um those all the that all the remote nodes were watching single single poor etcd that was basically struggling with 50 000 nodes so we introduced one more component into the cluster mesh we call it kv store mesh and the idea is that instead of having all of the nodes in the cluster mesh mesh watch the etcd we have another one in in in each cluster that replicates the data for from other clusters so what it means is that nodes within single cluster they are only watching the etcd in the same cluster so going from 50 000 clients that we had previously in our architecture we were able to reduce the number of clients to couple of hundreds and this is way more manageable and much easier for etcd to handle that was our assumption but we of course wanted to make scale tests and ensure that our assumptions are correct so we did exactly the same test and what we saw was that when we have like super high churn of nodes as you can see here at the beginning um then the amount of data is quite high so we were throttling actually the amount of data that was propagated across clusters so you can see that at the beginning during the high node churn there was delay around one second but it was pretty stable but then once the node churn decreased and we were adding nodes slower and slower the data propagation delay was much lower in terms of like maybe it was like tens of milliseconds and then of course like you know taking um look at our experience with previous architecture we were wondering like okay so let's take a look at etcd again and um as we can see like cpu usage of etcd was totally flat all the time similarly memory usage was not couple of gigabytes but but couple hundreds of megabytes so our title actually mentioned 100 000 nodes and you might ask okay so why i was showing only 50 000 nodes so first improvement that we are um looking for in the cluster mesh is that sometimes we see that our customers want to scale even beyond 250 55 clusters so we are actually adding support for 511 clusters in a cluster mesh um and then on top of that there are some other improvements that i would like to talk about so we were targeting around 100 um end point propagation between clusters we are also looking into increasing that so the latency of data propagation gets lower and last but not least we were also looking into uh reducing the initialization time of control plane for cluster mesh you can think of it as like if you are doing the upgrade and you need to upgrade the cluster mesh uh control plane then basically it needs to synchronize quite a lot of data so we are also taking a look into that to minimize the amount of time that that is required for uh for initialization of the cluster mesh um so yeah thank you that's that's all and please provide feedback uh we really appreciate it and now we have time for questions the question about the class mesh uh i believe sorry i have a question about class mesh so i believe it's using annotation to set the cdm global service and the cdm agent sync the service from the uh other clusters and then you check the service one by one tell if it has an annotation or not uh so my question is why not just use the label i mean you can use sorry could you could you speak a bit louder oh sorry yeah uh my question is why not to use the label instead of annotation because if you use the label you can just get the uh cdm service by filter right you don't need to check it one by one sorry sorry i didn't get it sorry um wait okay maybe i will just jump here it seems like the microphone here is yeah yes annotation yes why not okay so um yeah so i think the question was um so currently how the global services work in cluster mesh is that you need to annotate the service and then the service is um propagated across cluster meshes and the question was if instead of that we could do the filtering on the agent side right kind of you could query okay so the question is like instead of like propagating all of the data maybe the agents in the remote cluster could actually query the API server in the other cluster so this kind of raises you know reliability issues i think and one of the things like imagine then like having 50 000 nodes talking to single Kubernetes API server that would be even worse as we saw with the single cluster issues like it could work potentially up to 5 000 nodes but once you go beyond beyond that scale you cannot really rely on the Kubernetes control plane to propagate that data for you uh the question also was if could be a kind of like lazy loading thing where like once you use the service it could query the Kubernetes API could it be all services instead of just the ones that are labeled yeah i mean like we could theoretically so how it works right now is you need to actually annotate the service that you want to expose and theoretically we could we could we could transfer all of the services but that's not really efficient and also from security perspective like you want to limit amount of services that are exposed between clusters so yeah okay um do we have other questions yeah i have one um when he talked about uh Selium operator that he will uh like do the identity allocation is that already available in the 1.15 or no not yet but it should be available in 1.15 yes okay cool thanks hi uh first of all thank you for the great presentation and i have a question to Doride because we are using GKE and they are using like a data plane version two it's not as same as the Selium so have you tested kind of scalability i mean i i'm just curious how many nodes we can scale always the data plane version two instead of the Selium yeah so the data plane v2 and gk supports up to 5 000 nodes as long as you have a regional cluster because there are issues if you're a zonal cluster you have only one uh kubernetes control plane vm so uh again it's related to scalability is not just the number of nodes we are supposed to we are supporting 5 000 nodes and 200 000 pods with exactly those limitations that i presented here and those tests were done on gk so if you want to run network policies on gk powered by Selium because data plane v2 uses Selium okay thanks so from my understanding it's same like a limit and same like a mechanism is like a native Selium yeah that's exactly that i presented those are the limits for gk right now it's a kind of related question about like data plane version two they are not supporting like a Selium policy we need to use the network policy instead affect any issues yes that's a good question i didn't mention that the tests here were done for kubernetes network policies so the standard objects and not the custom resources uh some Selium network policies are not affected and would work but there are some cases where it doesn't work and there are some of them that are already fixed and uh data plane v2 is looking to support also some other network policies but it's still in progress so it's mainly about using them on scale in general it's not supporting because uh on gke we are resolving some of the network policy related uh use cases with in a in another ways so in some other ways so that's that's it yeah good great thank you sorry i just remembered uh all our question for the Selium operator um is that gonna work with like kvstore identity allocation mode because as far as i know Selium operator is not connected to etcd it's only like no it's not using kvstore it's using crd yeah i know but so it's not gonna work with uh kvstore mode it can it will it will we will probably like right now right now we don't have a clear path of it it is going to be first with crd uh if there is a there is a need for it we'll we'll have it okay thanks my question's around um identity you said something in your side about 65 000 identity originate scale to how does that correlate to ciders right if you're talking outside the cluster to a legacy environment like vm's or something where you can't rely on identity okay so uh you mean like in terms of like cluster mesh for example right or cluster mesh or just a single yeah so so this is basically kind like per cluster limitation so if you are using like cluster mesh for example then the limitation applies to each single cluster separately so you have like you can have more identities than than that in cluster mesh and um does 65 000 equal a cider limit like how like ciders could i put in like a bpf map or total identity yes so i i can answer so uh one ebpf policy map can have only 16 000 entries so you cannot specify more than different 16 000 cider ranges so one cider range really equal to one local identity but not the same as cilium identity which is global because every node is programming cider ranges as identities locally and doesn't share with other everyone can do it on their own they don't need to share this data so you can still the limit is 16 000 different cider ranges and what if i had to go over that would that be something that it's possible it's it's possible to increase the map size right now why we are not looking into that direction is because there is a significant memory increase because it's a fixed size so every time you create a new pod there is going to be a new policy map for it and then it's a fixed size right now 16 000 entries there is some amount of memory that needs to be allocated so it's possible it's a trade-off you you can do it we are looking into a direction where we are reducing the number of identities with where it will just work and it will be as optimized as possible meaning that you will not have to use per pod additional memory on the node that will be just fixed and even if you because even if you're then not populating these maps this memory is still allocated and not released yeah great talk thanks thank you