 I hope everyone is enjoying their day so far here in Chicago. Yes, so again, my name is Ryan Drew. And we are indeed talking about KVs for a much day in scalability. Before we get started, I do want to give a huge shout out to these folks here. The project I'm going to talk about in these slides was a large group effort that we all worked on. I'm just the one who's presenting on it. So I want to give recognition where it's due. But just to break the ice a little bit, you can find me on GitHub and on the Community Slack at Learn-It-All. I've been a performance and scale engineer at Isovalent for just over a year and a half now. And I based out there in Colorado. And this is a picture of my cat, just for fun. So let's just dig into some background information here. So I'm going to assume that we all know what ceiling is. So I'm just going to talk about cluster mesh. But cluster mesh essentially allows you to have intelligent cross-cluster communication. So if you have multiple clusters connected to each other or multiple clusters set up in your environment that are trying to talk to each other, without cluster mesh, those clusters are going to view each other as external or world identity. When you enable cluster mesh, these clusters become aware of each other. And so you can do cool things like have a load balancer, route traffic to a back-end in another cluster, or do a really intelligent policy where a policy can reference labels that are applied to pods in a separate cluster. But one of the biggest things about cluster mesh is the aspect of scalability. Because cluster mesh allows you to scale your environment beyond single cluster max scale recommendations. So if we look at the scale recommendations that are provided by Kubernetes, we see 150,000 total pods, as well as 5,000 nodes per cluster as a maximum. But cluster mesh can allow you to connect up to 255 clusters. So theoretically, you could have 38 million pods and 1.2 million nodes in your environment all connected together. And again, this is very theoretically. And we'll talk about why here in a second. So KVStoreMesh is a feature as a part of cluster mesh that's beta and 114. And it's deployed alongside what's called the ClusterMesh API server. And the ClusterMesh API server is the main component deployed with cluster mesh that allows you to enable that cross cluster functionality. There is also KVStore, but this has nothing to do with KVStoreMesh. So this is a little confusing, but we'll bear with it. So in this talk, we're going to cover the ClusterMesh API server propagation latency problem. We'll talk about some testing that we did in order to explore this problem. And we'll talk about how KVStoreMesh allows for higher scale. And so our big thesis statement for the day is that KVStoreMesh allows for higher scale cluster mesh deployments by reducing the load on each cluster's cluster mesh at CD instance. Oh, and if you can tell me how many times I said the word cluster in this talk, I'll give you a key chain. It's going to be a lot. So let's get into this latency problem here. So if we look at how ClusterMesh API server works under the hood, it looks something like this. So in order for Selenium agents to be able to intelligently talk to each other across cluster, there are four resources that need to be synced across these clusters. Services, Selenium modes, Selenium identities, and Selenium endpoints. So the agent and the operator inside of each cluster are going to modify these resources in the Kubernetes API server. And the ClusterMesh API server is going to watch for these resources and essentially clone them into its own at CD instance. This then is going to be made available to remote Selenium agents in other clusters that can then pull these resources in and do that syncing. And as they get these resources, they'll plumb the data path to allow that expected connectivity. But this process raises a key question, because this state propagation has to occur in order for your connectivity to be enabled. So what happens before that? Before that, you're not going to get that connectivity. So for instance, if you have a remote client pod trying to connect to a database in another cluster, that traffic is going to be thrown right in the trash by the data path. Because it's not going to recognize that workload. So two key questions we wanted to explore with this was how long does it take for this propagation to occur? And can this propagation ever fail at a high enough scale? So let's get into how we looked at this. So we created this workflow for testing where we were focusing on the number of nodes in the mesh as our primary variable. So we created 255 clusters and meshed them all together. And then we created a continuous workload that consisted of two parts. The first one was a NOOP workload, which is essentially 10 pods per cluster. This was to get some identities and some endpoints working in the ClusterMesh API server. And then we deployed some benchmark tooling, which I'll talk about here on the next slide, in order to add load on the ClusterMesh API server. Additionally, so as these are running, we just slowly increased the number of nodes and increments until we got to 50,000. So the ping test was the benchmark that we wrote for this. And the goal was to sort of try to create a propagation latency heuristic that we could look at throughout our test in order to observe how the mesh behaved as the number of nodes increased. And it all centers around this ping server here. So the ping server in ClusterB is going to start by creating a senior network policy that allows traffic to it on ingress with a new random label. And this label is generated, again, randomly, on every iteration of this test. And that's really important. Then the ping server is going to create what's called the ping client inside of ClusterA by just contacting the API server directly. And the goal of this ping client is to ping the ping server as fast as it possibly can on startup. But during this process, there's a cubelet in ClusterA that's going to pick up the request for this ping client and then pass a CNI add event onto the CNI agent, which, again, is CNI's primary responsibility as a CNI plugin. We have a CNI listener pod running in ClusterA that tails the logs inside of the CNI agent in order to get an estimate of how long the CNI ad duration took because we currently don't have a metric to expose this right now. And then this CNI ad duration is sent to the ping server so it can be exposed and recorded. So during this process, we're recording three things. We're recording the CNI ad duration. But the ping server is also going to record the time that it sent a request to create the ping client, and it's going to record the time that it received a ping from the ping client for the first time. And by subtracting these two from each other, we can get a heuristic of how long the propagation took in order to get the ping client information over to ClusterB so the data path could be plumbed properly. And there's a lot that happens during this time period because this duration records how long it took to send the request to create the ping client over the network, how long it took the API server and ClusterA to process that request, all these kinds of things. In bold are the things that CNI is directly responsible and what we were really interested in. But at the same time, if any of these things increase, it's something that's important and what we want to look at. So the heuristic, although it isn't super detailed, it did give us a good jumping off point of really getting into things. So we also measured a couple of other things just to help us understand what was going on, such as CPU and memory usage. There were a couple of CLM agent metrics that we really focused on in terms of policy. But we also focused on ClusterMesh API server metrics related to EtsyD. So let's talk about the test environment. This is one of my favorite parts of this project. Because we could have created 255 clusters in the cloud and scaled them all up. So we had 50,000 nodes in total. But trying to do that in the cloud would be really difficult if you like trying to buy tickets to a Taylor Swift concert, as it's not going to happen. So we had to come up with a creative way to reduce the amount of resources that it would take to run this test. And we ended up with this sort of architecture per cluster. So each cluster consisted of two nodes, a control plane node and a worker node. Control plane node would run the customer API server. And the worker node was responsible for running what we called hollow nodes. And that consisted of two parts. It was kubemark and citymark. So kubemark, if you're not familiar, allows you to run what's called a hollow kubelet. Where you have a kubelet that's running, but all the lower level implementation details are hollowed out. So the kubelet will talk to the API server as if it's starting containers and not taking volumes, but it actually won't perform those actions. And that allows you to run 10 kubelets or hollow kubelets, excuse me, for every one CPU core on your node, which was great scalability for us. But because this kubelet can't run Seliem agents, we also needed a tool to mock the load that Seliem is gonna put on the API server and the customer API servers. So we developed a tool called Seliemock to do this. And that's gonna apply that load so that we were actually testing something. Now, in order to run our ping test benchmark that we talked about earlier, we had two special clusters in the mesh that we deemed specific to running these benchmarks. These differ in two ways. So first, they deploy Prometheus in order to gather metrics, but they also have a higher worker node size because we didn't want resource restrictions to limit the benchmark anyway, especially since we're adding Prometheus on top. And the reason why we only ran the ping test benchmark among two clusters rather than all the clusters is because we're assuming that every cluster in the mesh has a similar view and experience inside of the mesh. And so we didn't think there would be much value in testing this between every single cluster, just two. And so that gave us another optimization as well. If we draw the connections between these two clusters that are running the tests, we can see these are kind of like the communication paths. In green is the benchmark communication paths. In blue is Seliemock talking to its own API server to simulate the load that the agent would put on. And in red and yellow are the connections that are made cross-clustered to each cluster mesh API server. If we add an additional cluster, there we go, if we add an additional cluster, these are the new connections that are made between each cluster mesh API server. Specifically, three new connections are added because we have three new nodes that are talking to the cluster mesh API servers in cluster one and two. And so now each cluster mesh API server is supporting connections from six different clients. And that's just with adding three more nodes. So things kind of scale up pretty quick here. So let's get into the results. So these are the number of nodes that we had over time during our test. We got to just above 50,000. That was how the math worked out for scaling this up. And I won't make you squint. This took about three hours. Our CNI ad duration wasn't too interesting. It had some spikes here and there. Well, not here and there, pretty much everywhere. But it had a slow linear trend of increasing, but it felt pretty normal. So we just kind of moved on. This was the same for a policy implementation delay in regeneration time. So if you don't know, policy implementation delay measures the amount of time that it takes for a Selium network policy to be plumbed into the data path after it's first received from a Selium agent or to a Selium agent, excuse me. Policy regeneration time measures the amount of time that Selium takes in order to do its recalculation of all the policies on the node. And so again, these look pretty normal. So we just kind of moved on. The interesting part is the heuristic here, which had a lot of really interesting spikes. The y-axis here is logarithmic in scale. So we had a range of one millisecond to 30 seconds. And we tried to correlate these to churn inside of the cluster mesh. So this bottom graph shows the rate that we're adding nodes to the mesh. And there's a rough correlation between these spikes and the spikes in the heuristic here. But one really interesting part was that this plateau didn't have that correlating spike. And so we were really curious what was going on there. And it turns out we just had a complete failure of our benchmark at 45,000 nodes. Some reason it just stopped working for a couple of minutes. And led to that weird plateau. And trying to figure out what was going on, we traced it back to the cluster mesh API server, etcd. So these two graphs show the resource usage of etcd. The bottom graph shows CPU. And the top graph shows the watches for etcd over time. The green line shows the cluster mesh API server that was running alongside the PIN client. And it was responsible for propagating the Selium identity and Selium endpoint of that PIN client to the other cluster in order to enable that connectivity. The yellow line represents the cluster mesh API server that was running in a baseline cluster that didn't have to go through this additional load. And you can see that at around where we see the benchmark dropping, we get to around 50 cores of CPU usage on etcd in that loaded cluster mesh API server. And at the same time, the number of watches on etcd drops by half. And so we're assuming that some CPU saturation led to these dropped watches. And one of those watches was critical for our benchmark that caused it to fail. And so correlating the CPU usage to our heuristic, those line up pretty well. So revisiting our problem questions from earlier, right? How long does it take for this propagation to occur? Well, again, it's just for our heuristic. So calling this the propagation for cluster mesh API server isn't too accurate, but again, one millisecond, 30 seconds. And I think the key thing here to take away is that this range is pretty high. And yes, this propagation can fail if etcd becomes impacted critically. So how does KVStoreMesh address these? So again, KVStoreMesh allows for higher scale cluster mesh deployments by reducing that load on etcd. So if we look at the consumer model for the cluster mesh API server, every celium agent inside of the mesh has to connect to every other cluster mesh API server in order to do this sync. KVStoreMesh deploys a binary alongside of the cluster mesh API server that syncs information from other clusters into the cluster mesh API server. And then agents running inside of that cluster connect to its own cluster mesh API server instead to sync that information. So we're adding this intermediary cache here. And if we look at the number of clients that are supported for every cluster mesh API server between these two implementations, you can see the difference in scalability pretty clearly. Because the cluster mesh API server without KVStore has to support the number of clients that are equivalent to the number of nodes in the mesh, whereas the KVStoreMesh implementation supports, number of clusters in the mesh minus one as clients outside of the cluster and the number of nodes per cluster as clients coming from inside the cluster. And using numbers from our tests from earlier, I rounded up the number of nodes to a clean 200. You can see that in the cluster mesh API server case, each one was supporting 50K clients. Whereas with the KVStoreMesh, we would only be supporting about 454. So key takeaways here. If there's anything to take away from this, this is where it'll be, right? So KVStoreMesh allows you to reduce the overall LCD CPU resource usage, as well as memory by spreading the load that it has to take on proportionally throughout the mesh. So every cluster is now responsible for the number of nodes inside of its own cluster, rather than really being responsible for handling load from the entire mesh. And this allows you to increase greater scale, but it also reduces the mesh-wide impact from churn that is caused from a single cluster because we get some isolation. Again, in the cluster mesh API server case, let's say that cluster A is scheduled for a rolling upgrade of Selium. And we have Selium agents that are restarting. As the Selium agents restart, they're gonna do list calls to the cluster mesh API server to do an initial sync, which is gonna add that increased load, and that could impact availability for cluster B. In the KVStoreMesh case, those agents would instead contact their own cluster mesh API server, and so cluster B will remain unaffected because we have this isolation in place. In the worst case scenario, if the cluster mesh API server in cluster A restarts, that's only one client that's doing a reconnect to every other cluster mesh API server, which is a lot more manageable. And that's it, so thank you very much. Yeah, any questions? Oh, and please feel free to fill out the survey if you have feedback, it'd be super helpful. Cool, I have two questions. First one, would you not recommend running KVStoreMesh? Like, if I run cluster mesh, why would I not run KVStoreMesh? I mean, is there a reason why you would not want to do that by default? So that's a good question. I'd say right now, probably because it's beta. That'd be the main thing. The other thing is that KVStoreMesh, you're gonna see the greatest results with the higher scale. If you're running in lower scale situations, you might actually see increased load from KVStoreMesh. Yeah, okay. And second one, do you have any idea if it can work if I use the identity allocation on KVStore, so I run at CDs that form the cluster mesh. I don't have the cluster mesh API server at all. So that's a good question. If you're using KVStoreMode, you could use a similar implementation with KVStoreMesh. I just don't think we support it right now, but it should theoretically be possible. We are thinking about. Oh, cool, okay. Thanks. Do you have any thoughts on how to protect a lot of freight activity from one cluster and how do you protect the rest of the clusters from being impacted, like a shared fate kind of thing? Yeah, so with KVStoreMesh, I think that's the best way to do that in a cluster mesh scenario. As if you have a single cluster that is gonna put high churn in your mesh, which typically happens, right? Cause we have maybe like two clusters outside of the mesh that are having the most amount of churn and are gonna put the most amount of load on other cluster mesh API servers. When you have KVStoreMesh enabled, that extra load is only applied to the cluster mesh API server inside its own cluster, rather than throughout the entire mesh. And so that would be what I would recommend.