 Hello, everyone. Thanks for joining us. So today, we'll be talking about CRD versus dedicated HCD. Lessons from taming high-churn clusters. My name's Hemant. I'm an engineer at Datadog. And I'm also a CLIUM CNCF maintainer. And this is Marcel. He works on CLIUM at IsoEllent. And he's also active in Kubernetes 6 scalability. So CRD versus KvStar. So if you're building an application that is Kubernetes native or like a controller, you have a decision to make. If you have some state that you want to store, you have a decision to make. So do you use CRDs to store that data? Or do you use a dedicated KvStore to store that data? So in addition to the total volume that you want to store, your data access patterns should also influence that decision, right? So in this talk, we'll get into some details on some of the decisions CLIUM had to make, right? So we'll discuss quickly some aspects of Kubernetes scalability and how Kubernetes scalability impacts different features in CLIUM and some of the current state and some of the options that exist in CLIUM today to help you and how the road ahead looks like and some of the lessons we learn along the way. So with that, I'll let Marcel talk about Kubernetes scalability. So let's start talking about Kubernetes scalability. So usually when you think about Kubernetes scalability, what you have in mind is like, what's the number of nodes, right? Well, wrong. Like, scalability is not just the number of nodes. When I think about scalability of Kubernetes, it's important to think about multiple dimensions. So number of nodes is just one dimension that you should care about when thinking about scalability. And the next thing that usually comes up to mind is the number of pods, right? But this is not all. So the question becomes, what are those other dimensions that we should care about when we are operating Kubernetes cluster at scale? So now we will talk about two different Kubernetes features that probably all of you are familiar with and try to think about how they affect scalability of Kubernetes. So let's start with Kubernetes services. So the idea is quite simple. You define service. The service gets virtual IP or cluster IP assigned. And then you have a bunch of backend pods, right? So from client perspective, when you want to connect to the service, what happens is that you initiate the connection to this virtual IP. But you need to have some kind of proxy. And usually it's just kube proxy that understands the concept of services. So the proxy is responsible for translating that virtual IP to one of your backend pods and initiate the connection for you. So now let's make this thought experiment. If you were going to implement kube proxy, how would you do it? So first of all, obviously you have service object and endpoint object. So endpoint object is essentially a list of IPs of all the backend pods that are behind the service. And of course, we have a bunch of services, bunch of endpoints. And corresponding backend pods behind the service. And all these objects live in kube API server. So they are actually served to kube proxy for read requests from kube API server. And of course, the proxy is running on one node. And it needs to understand all the services and all the endpoints. So it initiates the watch request to kube API server to get all the updates of services and endpoints. But of course, in the Kubernetes cluster, you have a bunch of nodes. And each node needs to run this proxy. So what happens in this case is whenever service or endpoint is being updated, kube API server actually needs to broadcast this information to all of the proxies that are running on your nodes. And there is one more component. So even though when you are thinking about services and there are some backend pods behind those services, the information that is exposed is basically only the IPs behind the service. So there is endpoints reconciler that actually understands what are the backend points behind the service. And it's publishing this information about IPs to an object called endpoints. So what's the issue with this design? Well, basically whenever you create new pod and this pod is behind the service, what happens is the endpoints reconciler needs to update the endpoint, right? And you can think of it as you can have one service that has thousands of pods behind it. And single creation of the pod makes update to the endpoint and this endpoint has thousands of IPs. So this huge object needs to be broadcasted to all of the nodes. So this is definitely inefficient. So that's why in Kubernetes, Kubernetes introduced endpoint slices. So instead of having like one-to-one mapping between service and endpoints, what you have is one-to-many mappings where endpoint slices have fixed amount of IPs behind, like within the single object, there is a fixed amount of IPs. So it's up to 100. And what happens then is like if you create the pod, then you don't need to send out all the IPs behind the service. What you only need to do is send the update of the endpoint slice, single endpoint slice that is with fixed amount of IPs. So now let's take a look at network policy. So in a sense, network policies idea is quite simple. You have some frontend pods, you have backend pods, and you have database pods. And what you want is, well, you want frontend pods to be able to connect to the backend pods and backend pods to connect to the database pod. But by default, in Kubernetes, like each pod can talk to any other pod. So it's possible that the frontend pod can connect to the other frontend pod or even talk directly to the database. So what you want to do is actually, you want to restrict those type of connections. And that's the purpose of network policies. So let's take a look at example network policy, how it describes such connections that are out. So first of all, you are starting with saying, okay, this network policy applies to those specific pods. But even though we were thinking about deployments, how deployments talk to each other, here we are specifying labels that we care about. So it says that the network policy applies to all of the pods that have this specific label. Then we specify whether it's ingress or egress traffic and which pods our frontend can connect to. And here's example where you can specify both namespace selector and pod selector. So what it means is, well, my frontend pod is able to connect to all the pods that live within the namespace with the label, like specify label, and then also those pods need to actually have specific label. Which means it's quite more powerful than just saying which deployments that you have in cluster can talk to each other. So now let's think like from kind of like the proxy perspective, how would you go about implementing such feature to actually restrict those connections that you are interested in? So again, like let's say we have some kind of proxy running on the node and well, we said that we need to be aware of labels of the pods. So like obvious thing would be to just watch for all the pods, right? And get all the updates of pods so we are aware of the labels of these pods. But the issue with that is that the pods are being updated quite often and there is no easy way to actually filter out what updates we are receiving. So think about like cubelet updating status of the pod. What happens is like as a proxy, we would be receiving all of these updates even though we don't really care. We only care about the changes to the labels in order to enforce network policies. Then of course like the network policies reference as well, labels from namespaces. So we need to be aware of the labels of the namespaces so we need to watch for those as well. And last but not least, network policies if we are going to enforce them. And again like if the proxy or whatever it is is running on all of the nodes, then each update would be propagated. So this is kind of like the naive implementation that you can think of. And we can see that if you have high turn of the pods then in this case the information would be propagated to all of the nodes. So we went through those two examples to kind of like think what are other dimensions that we should care about. So first of all like there is obviously number of nodes and number of pods, it's important. But then we should also think about pod turn. We saw that in both cases pod turn actually matters to take a look like how many different events are being broadcasted within your cluster. Another thing, number of services. Then like what's the churn of the beckons behind the services? What's the number of namespaces? What's the namespace churn, right? We still need to understand like those labels for enforcing network policies. And network policies and the churn of network policies. So just to recap like if we are thinking about scalability of Kubernetes, not just the number of nodes, but it's actually way more dimensions in just based on those two features that we were talking about in Kubernetes, we can see that it's quite multidimensional problem. So with that in mind, let's think about what's the target scale that we would like to support if we are implementing this kind of like proxy. So the target scale is based on official recommendations from Kubernetes. So you probably know that Kubernetes supports up to 5,000 nodes, then like less known limitation is recommendation actually is 10,000 services, then 150,000 pods, which means like on average you should have 30 pods per node. Then pods churn around 100 pod changes per second. It can be either creations, deletions or just changes of the pods. Network policies, this is something that actually we came up with for thinking about from ceiling perspective. This is not the official recommendation, but we need to establish some target scale that we are interested in. And also the network policies churn around 20 per second. So going back, Hemant was mentioning before that we are working both on Cilium. So Cilium is EVPF based networking observability and security. And one of the open source projects that we have is Cilium CNI. And as a part of CNI, we need to implement those two features that we were mentioning before, Qproxy replacement. So instead of like using Qproxy, you can use CNI to implement those things. And then also network policies. So I was mentioning before that the naive implementation would be to just watch for all the pod changes and all the namespace changes. But that's not how we implemented in Cilium. And the reason is that we are not interested in those updates that Kublet could be doing to the statuses of the pods. So how we went about this is, we created two CRDs. First one is Cilium identity. So instead of watching for the changes of the pods and namespaces, what we do is we create Cilium identity. And this is kind of like the summary of all the labels. Which means like whenever the pod status changes, for example, by Kublet, we don't actually need to know about it. Because we care only about the labels of pods and namespaces. And then we also have Cilium endpoint. And those two things are enough to actually enforce network policies. So the idea is that you have single Cilium endpoint per pod running in your cluster. And it's mapping between the pod and the identity. So we know that, okay, this pod has this particular identity and we know all the labels. So with that design, we only need network policies in order to enforce them. And Cilium agent watches for the changes of network policies, Cilium endpoints, and Cilium identities. So this is much better design as compared to like naive solution where we would be receiving all the pod updates within the cluster. So now let's make a step back. And I was mentioning kind of like the default mode in which Cilium runs, which is the CRD mode. But that was not always the case. So Kubernetes was released in 2015. For Cilium, initial release was also 2015. But well, CRDs were introduced in 2017. So I was mentioning about those two CRDs, but the question is like, so how did Cilium work actually if the CRDs did not exist at that time? And the answer is, well, HCD was used as the control plane for state management between Cilium agents. And please don't quote me on these dates. These were taken from Wikipedia, so they might be wrong. For example, like if you take a look at the releases of Kubernetes, you might see that 1.0 might still be kind of supported, like end of life date is empty, so I don't know. And to sum up, basically we have two different ways of running Cilium, one with HCD, one with CRD. And now Hemant will talk a little bit more about what are the differences, what are the pros and cons of all of them. Thanks, Marcel. I'm sure someone's going to make a Wikipedia edit now. All right. So we know that Cilium has two identity allocation modes. So there's the CRD identity allocation mode and the KVstore identity allocation mode. So if you take a closer look at the Cilium documentation, you'll realize that currently CRD mode is recommended only up to 1,000 nodes. And if you wanna go beyond that up to 5,000 nodes, it's recommended that you use the KVstore identity allocation mode with HCD, right? So let's zoom out a little bit and understand how the architecture looks like right now, right? So you have your Kubernetes control plan and Cilium primarily has two components, right? So there's a Cilium operator which runs as a deployment in your cluster and there's Cilium agent which runs as a demon set on every single node in your cluster. And there are other components like cluster mesh API server, depending on if you're using cluster mesh or not, but we'll not talk about that in this session. So depending on how you're managing your Kubernetes clusters, you might be running on cloud providers with using managed Kubernetes services. So if you're using managed Kubernetes services, you generally don't have to worry about your Kubernetes control plan, but your Kubernetes control plan is also backed by its own HCD cluster, right? So if you are using CRD mode, Cilium creates CRD objects so that the operator and agent can talk to each other and maintain state. But if you want to use the KV store mode, you will have to manage your own HCD cluster, right? So there is no way around that. You'll have to manage the entire lifecycle provisioning, maintenance and all of that. And when you use the KV store mode with HCD, Cilium endpoints and identities are now returned to your HCD cluster. So what are the challenges with running your own HCD cluster, right? So how does our day two look like with HCD? So for starters, you'll have to worry about taking periodic backups. Of course, you also need to know how to restore from those backups, right? And you'll have to run something called compaction, right? So HCD keeps track of the history for all the keys that are stored in HCD. So every once in a while, you'll have to run compaction so that you can save on your disk space. Luckily, there are a few options that are already built in in HCD. So you can set auto compaction and HCD will take care of it. But the problem is every time you run these compaction operations, it might leave your disk space fragmented. And if that happens, you'll have to run defragmentation operations. And this does not happen automatically, right? And you need to do defragmentation on a per node basis, right? So, and another thing to note about defragmentation is that this is kind of like a stop the world model. So every time you run your defragmentation operation, your HCD stops responding does not accept any read or write requests, right? And you'll have to run these on every single node in your HCD cluster. So you'll have to have something that coordinates those defragmentation operations. And HCD clusters also have default space quotas. So if you fail to do these defragmentation operations or something else happens and you end up filling your quota, HCD goes into an alarm state, right? And when it enters this alarm state, it again stops responding to read or write requests. And that's bad as well. And on top of that, you'll also have to worry about periodic updates, security patches and things like that. And the crux of this problem is there is no official tooling to take care of all of this. So you either have to write your own tooling or use some unsupported or unofficial things that exist on GitHub. And let's say you put some scripts together and manage the life cycle of your HCD clusters. That's not enough either, right? So HCD clusters are also very sensitive to disk and network latency. So depending on how active your clusters are and how big your Kubernetes clusters are, sometimes your disks might not be fast enough or your network might not be fast enough. Luckily, HCD already has some telemetry for this. So if you see logs like this, your HCD instances might need an upgrade. And it doesn't stop with that either, right? So you'll have to make sure that all the clients that are connecting to your HCD clusters are well behaved, right? So let's say you do an HCD rollout and so every time you do an HCD rollout, there is a chance that your clients might be skewed onto one of the HCD nodes. And this is not great for your cluster's health either. It makes the node with the most connections more prone to getting home killed. So actually API server already has a solution for this. There's a go away chance that you can set on the API server. And if you do that, API server will make sure that clients rebalance automatically. And if you find yourself in the situation with Celium Agent, you'll have to do a Celium Agent restart rollout to make sure you balance them again. And let's say if your clusters are getting bigger and you want to have more read throughput in your clusters, right? So generally three node configuration is the most common configuration folks run on. And this allows you to tolerate like one node loss. But if you want to tolerate like more than one node loss, you might have five node clusters, right? But the problem here is while trying to scale the reads and HCD is a Quorum-based system. So every time you post a write to HCD cluster, your majority of the nodes have to acknowledge that right. So in an attempt to scale your read performance, you'll actually negatively impact your write performance. So just horizontally scaling is also not the solution. And on top of that, HCD also does not have native rate limiting systems in place. So if your clients must behave for some reason or you're running into a bug, you could overwhelm your HCD nodes and which can eventually lead to being home killed. You can also enforce client-side rate limiting. Celium already does that. But that's not a perfect solution either because you cannot really coordinate all the clients at one place, right? So what I'm getting at with all this is that running your own HCD clusters involves a non-trivial amount of maintenance. And if you want to learn more quirks about HCD maintenance, there are already some great, great KubeCon talks out there, do check it out. You know why these folks are talking about it because they're running managed services. And luckily we actually have a great team at Datadog that takes care of running HCDs for us. And Maxime's here, he has built a great tool for us to automatically take care of a bunch of this maintenance. So as a thought exercise, let's say if you want to improve the situation, right? So what can we do? Maybe we can introduce a caching layer in front of your HCD cluster so that we can scale our read performance without having to take a hit on your write performance. Or maybe we can introduce a proxy that does some rate limiting before you send requests to your KV store. So does anyone know of a component that kind of already does this? A component that we are already like quite familiar with and use every day? Any guesses? Yeah? So a quick clue is it also has a very well defined API. So if you guess Kubernetes API server, you're absolutely right, right? So Kubernetes API server already implements all of these primitives. It's essentially a cache that's sitting in front of an HCD cluster. Every time you make requests, HCD is gonna cache that response. So your read performance will obviously be better because it's coming from a cache. And it already has a battle tested priority and flow control system. So you can make sure you can protect your Kubernetes control plane. And it also has options for batching as Marcel spoke about before. So in theory, CRD mode should be better than KV store mode, right? So why is it that CRD mode is recommended only for up to 1,000 nodes? So are we simply solving this problem by creating some dedicated infrastructure and just throwing some money at it to solve the problem? So let's unwrap and take a look, right? So the core of this problem comes from this concept of Selium endpoints, right? So I'd like to re-emphasize here that Selium endpoints are different from Kubernetes endpoints. There's one part, there's one Selium endpoint for every single part that's running in your cluster because Selium endpoints are the abstraction that Selium uses to manage the network lifecycle of the pod on the node, right? And in order to enforce network policy, Selium needs to maintain a mapping of the IP address to identity. And it does that by watching Selium endpoints. So every single Selium agent in your cluster has a watcher against the Kubernetes API server to get notified on Selium endpoint updates. So let's take a look at a scenario. Let's say we have a 5,000 node cluster and we have a new pod that's scheduled onto one of these nodes. And soon after your container runtime creates a CNI add call to the Selium agent, Selium agent will create an endpoint object for you. So that endpoint needs to live in your API server because you're using the CRD mode. So an update gets posted to the API server. And soon after that happens, let's say you have a 5,000 node cluster and Kubernetes needs to support around 100 changes, 100 part changes per second. And that's the turn we are trying to target. And with the turn of 100 endpoint updates per second and a 5,000 node cluster, your control plane needs to process 500,000 watch update events. This is a lot of update events. Depending on how you've configured your Kubernetes clusters, if you have proper flow control and proper things configured, then it will either completely kill your Kubernetes control plane or it will take forever to propagate those updates through your entire cluster, which is also not great. So what's the solution here? Luckily, as Marcel mentioned, Kubernetes also had the same problem with Qproxy and the solution for this was Kubernetes endpoint slices, which allows you to slice your total endpoint objects into fewer batched endpoint slice objects, right? So a similar solution was also implemented for Selium and it's called Selium endpoint slices. And with endpoint slices, like you can configure your endpoint slice batch size up to 100, so that allows you to bring down your 500,000 updates to around 5,000 updates. Those are a lot more manageable. So you're basically trading off some latency for scalability, right? The slide looks a little complex, but we'll walk through it. So the part where Selium agent, let's imagine a node has a pod scheduled on it and that part really does not change. So the container runtime makes a CNI add call, Selium agent creates a Selium endpoint object that gets posted to the Kubernetes API server. And then the new thing here is that the Selium operator is now watching for Selium endpoints instead of every Selium agent watching for them. And Selium operator is responsible for batching all of the Selium endpoints into a Selium endpoint slice, right? And we only need to send out that batched Selium endpoint slice object now, right? So that's basically 5,000 writes, a lot more manageable. And over the course of years, the Selium endpoint slices object has been, folks have been started to run it on production and we started seeing a lot more options to fine tune Selium endpoint slices. So you can precisely control how much your batch size is going to be, how you're going to group those Selium endpoints into Selium endpoint slices and the QPS set which the operator can write to the Kubernetes control plane. And there's also a way to now automatically configure them so that depending on how big your clusters are, you can automatically decide those rate limits. So it's getting a lot better now. But Selium endpoint slices, if you take a closer look at the documentation, they're still marked as beta. So in order to understand why that's the case and talk a little bit about where we are headed, I'll let Marcel continue. Okay, so let's first recap what was the story. So starting with Selium, Selium supported only the HCD mode, then CRDs were introduced, Selium started working with CRDs. In Kubernetes, endpoint slices were introduced and now we have beta Selium endpoint slices in Selium. So what's the future? So if you are interested in keeping track of all those scalability related improvements in the context of Selium, this is the issue that you can follow. And long story short, there are actually three most important things to mention. So first of all, we want to make Selium endpoint slices stable. And as Hemant showed before, there is quite a lot of tuning that you can do right now. But if we want to support end users, we want to make it really easy to use. So you just enable it and it works at scale. But then once we allow and say it's stable, we need to have still CI testing. So we want to actually cover scalability of Selium and CRDs in our CI testing. So once we say it's stable and people start using it, we want to make sure that there are no any regressions. So, and I'm saying 5,000 nodes again, but the thing is that it's way more than that, like all the pod churn, network policies, all this kind of stuff that I mentioned before. And last but not least, for users who are running right now in KB Store mode, we are working together with Datadoc on making it possible to migrate from KB Store to CRD. So I think the biggest lesson learned with all of that is that you need to consider access patterns in your clusters because as the scale grows or the churn increases, you need to be aware of amount of events that your control plane, whatever it is, either CRD or at CD needs to handle. And that's the most important thing I think to consider when scaling up the Kubernetes. One more honorable mention, so even though we're kind of comparing end point slices and Selium end point slices, there is still some difference because one is native Kubernetes resource, the other is CRD, and CRDs are basically JSON encoded, sorry. So native resources are protobufs, CRDs are JSONs, and this is some, this is really important cap to keep track of I think, even though you don't see it. So I would say basically it's C++ serializer and the idea is that it will bring down the gap between native resources and CRDs, so super exciting stuff to look for. But we still think like this is not required for scaling in the context of Selium for CRD mode. So thank you all. If you are interested in scalability topics, please chat with us. You can visit our Selium booth, and also please join six scalability channels. It can be either Selium or Kubernetes, we are on both of them, and please share the feedback. Thank you. And now I think we have five minutes for questions, if there are. No, Marek. There is a microphone somewhere. Okay, yes. I have a question. Why do you create separate Selium identity if you just to listen for label updates, if you could just use watch for only metadata changes, which is supported by Kubernetes as a feature? So actually like Selium identities are not the main source of turn, the endpoints are, right? So like, it doesn't matter, I think. And Selium identities allow you to like group different workloads together, so you can configure how you want your identities to be allocated. So if you just focus on a certain set of labels, you can aggregate different workloads into just one identity, so they all share one identity. So we need that abstraction to handle that. Thanks, it's a really great talking, interesting. I'm curious about the definition of churn rate. So I would have guessed that would have meant how often pods are being created and deleted, but is it actually how fast Qubelet is updating the status, and is that another screw you can turn? This is a very interesting question. So it depends, right? Like it depends on the component, what it means, right? Like some components care only about pod creations, right? But still, like even if you just watch for the pods, you are receiving all of the updates, right? So maybe you are interested in new pods being created, like in Qube scheduler, let's say, right? So the pod churn usually means like all the creations, deletions, and updates to the pods. But depending on the context, sometimes you care only about like some part of it, I think. But you are basically receiving all of the updates always if you are interested in watching the pod updates. And is there anything you can change that to say, I only wanna be updated on changes to certain fields or to tell Qubelet, please update at a lower frequency? Probably not. So that's, so about updating, I think, so this is the purpose of priority and furnace and that's one of the advantages of CRDs and API server over at CD that it can take into account amount of churn and throttle different write requests. Thank you. And one of the reasons why Celium Endpoint was created itself is because we don't care about all the pod status updates. So every time somebody posts a pod status update, we don't have to be notified about it. So the creating a separate object allows us to like decouple both of those things and make sure we only keep the relevant information. So yeah, thanks. Okay, thank you. Thank you. Thank you.