 It's very nice to be here with you today, and it's really nice to see so many of you are interested in HCD and control plane performance. So today, we're going to talk about HCD performance, but in a much more general way about the performance of control planes and interaction between API servers and HCD. I'm Laurent Bernay, I work at Datadog. And I'm Marcel Gémba from Isovolent. And so to get started, we're going to talk about scaling the control planes. And one of the reason I can talk about it is because at Datadog, we run a large number of very large community clusters to give you an idea. We have hundreds of clusters from like between 1,000 to 6,000 nodes. And as you reach like thousands of nodes in a cluster, you start to find interesting challenges. And we started like everybody, right? We started with a very simple control plane where you have a single master. I'm sure you're all familiar with these components where you have a single node where you have HCD, which is responsible for storing the resource in your cluster. You have the API server, which is responsible for the Kubernetes APIs. And then you have two core controllers, the scheduler, which is responsible for scheduling workloads and nodes and controllers, which are running the reconciled loops and making sure that the state of the cluster is what's expected. And of course, we have all the components interacting with the masters. So you start with this, but of course, this is not very resilient, right? You lose the node running all these components, you lose your cluster. So the next step is usually to have multiple masters, right? So exactly the same setup, but instead of having a single master with our three masters running the same components. This slide is a bit misleading, though, because all these components are stateless, except for HCD, because this is the data store of the cluster. So I have three HCD boxes, but it's actually an HCD cluster. And you can see here that all the components are stateless except HCD, which is stateful. And so it's kind of weird that you design this way. So a very common optimization is to actually get HCD outside of the masters. And so the next optimization is to do it this way, right? So you have an HCD cluster where you have your HCD nodes, and then you have your masters when you run API servers, controllers, and scheduler. The next thing that's interesting, which is if you're familiar with Kubernetes, you know that schedulers and controllers are actually running a single version, a single active version at any given time. So you have a leader election. And even if you have three or five schedulers or controllers, only a single one is going to be active at any given time. And so you can see how in terms of setting your resources, this is not very efficient, right? Because you're going to run processes that won't be active and won't be consuming resources in your cluster. So the next step of optimization that you can do is to actually move controllers and scheduler outside of masters and only run two, right? So you have two, one that's active and one that's passive, so you can do failover. And then you can add as many API servers as you want. If you've run large Kubernetes clusters, you've probably been impacted by something else, which is when you have a large cluster, you can have a large spike of events and events are a specific resource. And sometimes they will end up consuming 80, 90% of the space of your STD clusters. And of course, when this happens, it can impact the behavior of the cluster for events, and events are not really that useful. So a common optimization you can do for large clusters is to actually split dedicated STD for events. And you end up with this setup here where you have an STD dedicated for the event resource. And so if you have a big spike of events, this STD might be slower, but the rest of the control page is going to work fine because the other STD is going to be behaving as normal. And now that we've seen all these components and the different optimization you can do, we can talk about citing the control plane. So for STD clusters, usually you have three or five nodes because it makes little sense to have more because it's a current-based system. It's usually the most efficient, usually to have three for resiliency, but we prefer to run five just because if you run three and you lose one, you end up in a situation where you have to be extremely careful because if you lose a second node, then you hose and you've lost everything. So we prefer to run five because if you run one, we can still lose another one before being in a catastrophic situation. In terms of key resources for STD, I'd say the most important resource is disk. So make sure you have very fast disk and monitor the latency of your STD disk because this is really something that can slow down your cluster. Something that is little less obvious is STD also needs quite a significant throughput for network because when API server starts and when the fetch resource is from STD, the volume of data transfer can be significant. So STD nodes need fast disk and NFS network. API server can scale horizontally and that's amazing because if you have many things connecting to API servers, having more API servers will always help except that if your cluster gets bigger, API servers still need to cache everything in the cluster so you will probably need to increase the amount of memory for each API server. So you can scale them horizontally but don't forget to add more RAM as your cluster gets bigger. Controllers and schedulers are easier to scale because they mostly consume CPU so just give a quite significant amount of CPU for your controls and you'll be good. So of course this only works if you run your own clusters which we do, right? But in many cases, clusters manage services from providers and of course they do all the work we mentioned here in different ways. But even if you use managed services you can still play nice with the control plane provided by your provider. So for instance, you can make sure that the number of nodes you have in your cluster remains reasonable so avoid going to much above like 3-4,000 if you can. The number of services can also have a significant impact on the cluster so be careful in the number of services you have and of course if you churn pods a lot this is also pretty intense on the cluster. Some other thing you can do when you run large clusters is you can try and decrease the load on the control plane and I'll give a few examples here but there are many ways you can do that. So a quick example for instance is if you use GRPC you probably don't need a normal cluster IP service in your cluster because GRPC clients can discover all the backend IP and load balancing themselves. And if you do this it means you don't have to run the endpoint controller reconciling for all the services and endpoint can have a very significant impact on the cluster because you have to reconcil every time there's a change in readiness and then you have to distribute this data back to Qproxy everywhere and this is pretty heavy. Another thing that's been possible for quite some time but still not a default is I don't know if you know but something I discovered that when you run when you have a config map or a secret by default if you change the config map or a secret it's going to be updated in your pod. The Qubet is going to actually see that the resource has changed and it's going to modify data on this for your pod and your pod can actually see that the data has changed. So it can be useful for some applications that in most cases it's not and so you can make your secrets and config map immutable in which case the Qubet won't be watching for them and the load on the control plane will be much lower. And finally some controllers have their own optimizations and we're mentioning Syllium here which has its own KV store to store endpoint formation and decrease load on the cluster. And finally and this is going to be the main focus on the talk today is make sure the application you run in your cluster so your operators or your demonset behave nicely and will be defining nicely in the talk. Okay so we already mentioned a few of the components that are running in control plane so we have an API server at CD and Qube controller manager scheduler but what else is there? So first of all we have Kubelet. Kubelet is running on each of the nodes and it's basically running the pods, updating events and this can actually have significant impact on the API server and at CD performance except for that Qube proxy for service load balancing then different demon sets as example either Syllium agent and they are pretty powerful but with great power, great responsibility to not overload API server then we also have DNS cluster autoscaler, ingress controllers other controllers that you might be running in your cluster and of course users with Qube Cattle and what's really surprising is that also API servers really like to talk to themselves so now we will be focusing on the Qube Cattle to understand the interaction between Qube Cattle and the API server so let's see this simple example where you try to get the information about one specific pod so what happens is that the Qube Cattle actually issues one request to the API server with API server address, API version, namespace, resource type and resource name but what happens when you do the Qube Cattle get pods? Well here you can see that basically you are trying to list all the pods within the cluster and this request has limit 500 so if you are running thousands or tens of thousands of pods it means that the Qube Cattle will actually perform even tens or hundreds of API calls to the control plane which can actually have significant impact on reliability of your control plane another example you might be interested in seeing the changes that happen to those pods so you can do the watch equals true and again, first of all Qube Cattle basically lists all the pods and then it starts the watch and the most important part here is to see that the watch has resource version which means that okay we want to observe all the changes that happen from the specific version that we got from the list call except for that you have probably done that before you basically describe the pod to see like what's going on there so again we have the get call that gets the information about pod but then also what we have is listing the events so as you can see here we are looking for events with specific field selectors and please remember this example because it will be important pretty soon as we will understand how it works underneath another example let's say that you are trying to delete the pod and probably you already saw that when you try to delete something in Kubernetes via Qube Cattle then basically first of all it just deletes it but then you have to wait and this waiting is again the list call followed with the watch call so to sum up there are multiple components that are talking to API server and also when you are interacting with the Kubernetes using Qube Cattle then one simple Qube Cattle command can result in even tens or hundreds of API calls so now that we have the basics of the control plane components and that we know how the basic API works we are going to dive into specific issues that would illustrate what we meant by we want our demonstrate and console to behave nicely and this issue here is actually a real one that happened in the production cluster so it's actually a real incident so in terms of how it started I mean users were starting reporting connectivity issue to a cluster they were not able to connect and it turns out we looked at the metrics here and you can see that API server were not very happy with all of them were analysis and of course because they are analysis they can serve traffic so we started looking at logs and metrics and we could see that API server were not able to reach HCD and they were crashing not a very good place to be at so what was happening with HCD well it turns out as you can see on the top graph well HCD were using a lot of memory and they were actually getting umkeel which is once again not a very good place to be in so what we knew we knew that the cluster size had been the same for weeks or even months so unrelated to size we hadn't updated the control plane so it was not related to a new version of Kubernetes and so we figured well it's very very likely related to some clients doing something against our API and so we started looking at API server metrics and in particular at the number of requests and you can see here that during the two incidents the number of in-flight requests is very high like it's typically like around a hundred and during the incident it spikes above one thousand and especially this is the graph at the bottom and you can see that we have a very big spike in these calls and we're going to be talking a lot about these calls because these calls can be very expensive for your clusters especially large ones so to understand why we say that these calls are expensive we have to understand how API server caching works so API servers are basically big caches in front of HCD I mean they do a lot more but this is a key part for what we're going to talk about today and so when an API server starts it's going to list all the resources in HCD and start HCD watches to get all the updates and maintain the cache when a client has a query against an API server it doesn't catch or release for a specific resource and it can specify a resource version which Marcel gave in his example just before so if you specify resource version if you say I want this resource with resource version X you're going to get it if you have a more recent version so if the version in the API server cache is more recent than X you're going to just get it if the API server doesn't have it it's going to wait a little because sometimes an API server can be a little behind API servers have to process all the changes from HCD and so sometimes you're asking for X and the API server you're asking for X doesn't have it yet so it's going to wait for a few seconds for its cache to be updated and if it gets the new version it's going to send it to the client so what really matters is these two points when you ask for X it means give me at least X and there's a very specific case of zero which is well give me whatever you have in cache so what about get and list without a resource version I don't know if you remember the example that Marcel gave before but Qubectl when we were looking at the verbo's output of Qubectl we were seeing the get command, the HTTP get commands and there was no resource version in there except for the case of the watch so if you don't send a resource version what happens is the API server understand this as meaning get me a column consistent read from HCD so if you don't specify a resource version you're going to get the data from HCD and it's important because this is a default behavior of Qubectl get so if you do a lot of Qubectl get if you have shell script doing loops doing Qubectl get it can be very intense on HCD and also and even more important and counterintuitive if you use client go the default list command for resources is also going to do consistent read from HCD so if you want to do a read from cache you actually have to provide an option to the list called say get me resource version 0 but if you just do a list you're going to get a consistent list from HCD so here is an illustration of what this means if we get all the pods in a cluster with 30,000 pods and we don't specify any resource version we're going to get the data from HCD and as you can see here it's going to take more than 4 seconds in this example however if you say resource version to 0 it's going to be only 1.8 seconds and here we actually had to cheat a little to make this work because of course given the size of the cluster if we had a full JSON output it would be a gigabyte of data and most of the time would have been spent processing the data and sending it up to the network so we use a small trick which is the Qubectl trick where you say well get me the data as stable and so you get a summarized version of the object which is much smaller and so the time here is mostly that processing time and getting the data from HCD what about labeled filters I mean when you get resources you know you can say like give me all the pods from application A and this is what the query looks like and this is slightly faster when you don't specify resource version because of course there's less data to send to the clients however when you set resource version to 0 it's almost instance because filtering is done on the API server and then you have little data to send so what's very important here is when you do resource version equals nothing or if you don't set resource version you're going to get all the data from HCD and filtering will always happen on API server so even if the output is only a few pods because application A is only five pods in our example you still need to retrieve 30,000 pods from HCD which is pretty intense and I don't know if you remember to describe example from before if you look at the get here you can see that get events is actually not setting resource version which means the events are retrieved from HCD of course the filter vector is very precise because it's targeting the pod you want to get the events for however remember how filtering works this means that you're going to get all the events from this namespace from HCD into the API server and then apply filtering so for large namespaces it's going to be pretty intense on HCD and API servers so of course I mean I mentioned a lot that filtering happens on API servers and the reason for this is the way HCD is structured in Kubernetes so this is the way keys are organized right so you have the resource the namespace and resource name which means you can ask HCD for a specific resource you can have for all the resources in the namespace or all resources of a given type but there's no other type of filtering happening all the other types of filters are going to be happening in API servers and here you have three examples one is get me all the pods from app A and this means get all the pods from HCD to API server then filter on API servers so that's why it's slow the second example is get me all the pods from namespace data dog from app A and this is much faster because of course in the namespace data dog we only have a thousand pods and not 30,000 and the last one is the same one so get me this with resource version 0 which means get me this from your cache and this is much faster because of course filtering is done in the memory of the API far more efficient so whenever you write an operator or a controller think about using a new form because it's going to be much more efficient so in summary something we wanted you to remember is least call go to HCD by default and can have a huge impact even if you have filters that will filter a lot of the data a lot and return a small amount of data you might still get everything from HCD and that's very expensive and use informer as much as you can so let's get back to our incident so we know that the problem was coming from these calls right and because we saw the number of requests with these calls and the next step was to understand which application was making these calls and on our cluster we use audit logs extensively what's happening in the cluster and audit logs are very helpful because I can give you an idea what's happening in the cluster you can see all the queries but you can also see the query time so this view here is the aggregated query time the user for the least call and we can see that we have a single user counting for two or more days of processing over 20 minutes that's a lot of processing right and the reason it's more than 20 minutes is because of course you have multiple goal routine in the API server and so if you aggregate all the queries it can be a lot more than 20 minutes and this is another aggregation and here we can see that over a week we have a single service account responsible for almost three days of processing and that's a lot and service account is called node group controller so what is it so we actually have an INAS controller to manage pools of nodes at Datadoc so teams can create a CRD a pull-off node of these shapes and the controller is going to create another scaling group or manage instance group if it's on Google and manage it and we had been using this extensively for two years at the time it was working perfectly what had happened though is we had an incident the week before or the few weeks before where someone had deleted another group by mistake and they completely deleted the workload because of course when another group is deleted the controller is going to remove all the nodes and so we were like well we can very easily protect for this by not allowing deletes if pods are running on nodes so we wanted to implement deletion protection that seems simple enough well let's look at how it works so it's actually a very naive approach and a very simple one so when the node group is deleted the first thing you do is you list all the nodes for the node group based on labels it turns out sort of this node group add like dozens or even hundreds of nodes and this is not a big problem, the big problem is happening in X on point 2 so next we wanted to know if there were pods running on this node that were node dimensions and to know this well we did a very simple list all pods on the nodes and to do this you do a get pods and you do a filter saying well node X except if you remember how it works in HCD this means getting all the data from HCD and then filtering for node X so even if you have 5 pods for a given node you still need to reach 30,000 pods from HCD on API servers and then get the 5 you need for this node except well because we're very efficient we did all this in parallel so if we had hundreds of nodes in this node group well we were doing hundreds of calls to HCD for pods on each node in parallel which means we were doing hundreds of list pods from HCD and each one of them were retrieving 30,000 pods from HCD as you've seen in the first graphs HCD don't like it that much and this deteriorates HCD so what we learned is well this call can be very dangerous I mean the way we fixed it was actually pretty simple we replaced this by an informer and it was just done and remember use informers whenever you can and also audit logs are extremely helpful because they will tell you what and when and give you an idea what's processing and what's using time and now that we've seen this incident we'll see that it's very common we've seen this incident but many people have seen it and we've seen it multiple times this example is an interesting one but we've seen other issues and of course the community is aware of it and Martha will tell us what it's been working on to address it okay so Kubernetes community has been working pretty hard to address those of issues that we've seen especially like the example that Lorenz showed us so one of the really cool features that are in 127 Kubernetes in alpha currently is streaming lists so what happens when you try to do the list from the API server cache basically API server is preparing the list in memory and it calls it for the whole duration of the request so you can imagine that you know like if you have resource that is let's say one gigabyte of config maps then preparing the response for the list call it will be actually consuming quite a lot of memory and in 127 we have the streaming list which basically you can think of as kind of like watch where we stream the data back to the client without necessarily having all the response in the memory of the API server cache so let's see the example so how it worked before again like let's say we have one gigabyte of config maps one instance of API server and with eight informers you can see that the memory usage of API server was even like nine gigabytes just eight informers trying to list one gigabyte of config maps which is quite a lot but then like with 16 informers it actually ummed and killed the API server with streaming lists what happens is that you can run even thousand informers which consume only like 7 gigabytes of memory of the API server so it's 100x improvement because of the streaming list basically and memory usage of the API server so what's else what else is there so the priority and fairness we saw that there are multiple reasons why the control plane can be overwhelmed and priority and fairness has been worked on like quite extensively for like couple of like I think like two years and it was it is in beta since 120 and the main goal is to protect the API server from being overloaded and the idea is that you limit amount of request that can be executed concurrently but then again like you split those concurrences into different priority levels and within priority level you can still have multiple different users who try to execute calls and as the name suggests it's also fairness so the concurrency shares are distributed across different users in a fair way so let's take a look at the example how it works so let's say that we have requests that comes to the API server and we have a bunch of flow schemas flow schema basically describes that which kind of request should go to which priority levels so it's kind of like classifying them and let's say it goes it checks for flow schema number one but it doesn't change then it checks flow schema number two and let's say that this request matches the flow schema number two so flow schema basically points to one of the priority levels that you have in your Kubernetes cluster in this case let's say it was priority level workload high so we know that the new request goes to the priority level workload high but what happens next so one priority level actually contains set of queues let's say that in this example we have three different queues and they are actually only two users that are using this priority level and let's call them A and B and each user has fixed amount of queues that are assigned to this particular user so A is assigned to queues number one and two and user B is assigned to queues Q1 and Q3 so when we have this new request from user B it checks for how much work is in the queues that are assigned to this particular user so in this case Q1 and Q3 so we have two requests that are waiting to be executed in the Q1 and only one request in Q3 so priority and furnace decides to put this request in the Q3 as we have mentioned before request can have basically different impact on API server and at CD as well priority and furnace takes that into account and one request can have different concurrency weight, simple get gets just one concurrency weight but if you try to list something and you have thousands of pods or config maps whatever then the weight of the concurrency assigned to this particular request can grow up to 10 similarly we haven't mentioned much about mutating request but how they work is that you change something and it's simple operation from at CD point of view but then afterwards what happens is that all the watchers that are interested and watching for the changes need to get the event that it was changed so the more watchers you have watching for particular resource the more expensive it becomes so priority and furnace takes that into account and when you are doing the mutation of some of the resource again it checks how many watchers there are and can assign different weight to the request based on that so what's the default configuration of priority and furnace so there are six priority levels by default there is exempt which basically bypasses the whole mechanism then we have the system which is meant only for the cupelets, leader election only used for leader election of the core components like manager, scheduler or different cube system service accounts that use leader election then we have workload high which is for only cube controller manager and scheduler workload low for all other components including the node controller that we mentioned before which would be using workload low in this case and global default for everything else so if you for example do cube cattle whatever then it goes to the global default and basically what you can think of is that you are basically competing with your coworkers for the concurrency shares in global default so if it doesn't work like your request maybe you should ask the person next to you if he or she is running something there let's see the example priority level configuration so let's take a look at the workload high it has 40 concurrency shares 128 queues 6 hand size so hand size is basically the amount of queues that are assigned to one particular user then also the queue is limited so we don't want to have infinite amount of requests waiting in the queues so each queue can hold up to 50 requests and one more pretty cool important feature that was implemented in 126 is borrowing in priority and fairness so when one priority level does not really need those concurrency shares it can lend them to other priority levels and in this example workload high can lend 50% of concurrency shares to other priority levels if they need them so let's get back to some cool examples how you can use priority and fairness in your clusters so you can have for example misbehaving controller or a demon set like we had in the case that Lauren mentioned and with priority and fairness the cluster should still be working and it should be protected from being overloaded it still can impact other controllers running within the same priority level basically trotting them so what you can do is actually create new priority level and new flow schema that will redirect all those requests to new priority level if you want to mitigate the issue that your controller is basically putting load on other basically stealing the concurrency shares from other controllers one other use case we mentioned that it's good idea to split at CD into like the main at CD and the events at CD and well sometimes you cannot do that right so with high turn of events what you can do is create new priority level and assign all the request all the requests from related to events like creation or lists and basically throttle how much events are being processed of course worst case you will miss some events but it's still better than losing your control plane so conclusions running large clusters is still challenging there has been many improvements from the community defaults are not always enough and most importantly please avoid list calls and if you really need to list list from the API cache and also use informers thank you all for coming and now we have some time for questions hi we are running on premise on bare metal and actually we have a similar bit slightly different challenge running dense clusters with lots of pots on bare metal that you do anything with those use cases could you repeat with large number of we are running dense clusters with large numbers of pots mainly small APIs so we have a better part of memory and we try to achieve 500 pots per note but we see all kind of strange things happening then actually it's pretty interesting I think there are like two different parts of this problem like one is the control plane that we mentioned here how it works but then also I could imagine there might be some problems within like just the cooblet and managing those pots so I'm not sure which kind of strange behaviors you've seen but it can be either a control plane or maybe just cooblet itself and we have many issues with our cooblets clusters but we don't have these ones because we don't have dense nodes so at data we have a lot of nodes but most of them are running from a few to a few thousands of pots so we don't have the issues you're imagining but I wouldn't be surprised if it's tough on cnn and on the cooblet too also I would add that you know like for large clusters the Kubernetes actually recommends 30 pots per node like the maximum supported is 110 so with 500 you know like anything can happen basically Hello question here so assuming that we keep the limit of 120 pots per node and so we still have a large cluster and even we try to go with the best practices that you presented what is the bottleneck how much can we scale a CD database to actually keep large clusters hundreds of nodes working yeah so actually it depends on your use case for example if you use services heavily services are super super expensive for control plane because you have list of endpoints within the service and they need to be broadcast basically to all the nodes so I would say that first of all you should take a look at your workloads that you're running and understand which parts of those workloads are actually putting heavy load on the APA server to give you an example like some batch workloads can be run with even like 15 thousand nodes if they are like super simple as compared to like serving workloads with services second question have you reached the limit of scaling a CD database at some point so I can only talk for data dog and I don't think we have I think the biggest issue we had at one point is we hit the 2 gigabyte limits because it was a default side of a CD but now that the limit is 8 gigabytes we've never reached it but also we're careful we're trying to make sure that we never get cluster with more than 6,000 nodes and about probably 50,000 pods and within this envelope we've been good right and something we also we should have mentioned too but it was hard to have everything in the plug is sometimes your cluster is very stable at a very large scale and your everything is okay but an event happens and then you get in a situation that's not stable anymore and the problem is you don't have enough buffer for the cluster to continue behaving so even if things are running you still make sure that you have enough buffer in terms of memory, CPU so if there's a bad event happening your cluster can still recover because we've been in that situation where things are all completely okay at a given size and then there's a big thing happening and you can't really recover easily from it Yes and also I would add that priority and fairness comes here pretty well that you have stable cluster and then some events happen and priority and fairness should take that into account so let's say that you want to create 100 pods right and priority and fairness should start shortly in creation of the pods, scheduling those pods, taking into account the whole broadcasting of events and basically offload kind of the spread it across the time Thank you Here Thank you for the talk Did you consider in the node controller thing you had to use owner references with block owner relation or was it kind of not a fit for your use case? I don't know It might be the case now that we use owner reference right so you mean to make sure that we can delete nodes while pods are still running I think it might be the case that we've done protection this way now At the time it was, the first approach was very naive because we just wanted to address the issue but I'm sure we I know we've been improving it a lot but really it's probably one of the approaches we use Okay, yeah cool I was thinking from the perspective of like if it's not it should work that way and should work for that use case but if it was some kind of limiting factor in practice but maybe maybe you can use it Thank you very much Hello, thank you for the session I have one question regarding the max request in flight parameter Does it affect at all your testing because I believe that by default it sum up the mutating the mutating request and those normal ones Yeah, so before priority in furnace this was like the only kind of like overload protection mechanism and now since the priority in furnace as you mentioned like there are summed up right and those concurrency shares are basically like the total amount of requests that can be processed within the API priority in furnace so this is like super important to actually configure it properly to understand like how many requests your control plane can execute because if you set it like a million then the priority in furnace will not kick in basically when it's needed Okay, so what values did you use to your test I mean have you used the default ones or have you done it So I can speak from like scalability point of view so we run performance test to ensure that there are no regressions in Kubernetes and for that particular case what we do is we specify 10 in-flights per one core of the API server Cool, thank you