 Hi, my name is Maciek Pytel and I'm a software engineer at Google, as well as a long-term contributor to Cluster Autoscaler. In this talk, I'm going to focus on Cluster Autoscaler and specifically its scale-down logic. But before I start, I wanted to mention that for the Q&A session at the end, I will be joined by other engineers from CigAuto Scaling and we will be happy to take questions about both Cluster Autoscaler and other components maintained by CigAuto Scaling, such as HPA or VPA. In the past, we have done multiple deep dives during previous QubeCons, covering Cluster Autoscaler and I think we've done both some general overviews as well as talks focused on scale-up logic. I've linked some of them on my slide. This time I'm going to focus on scale-down and in order to make that possible, I'm going to assume a certain level of knowledge of Cluster Autoscaler and how it works in general. Just as a reminder, I'm going to do a quick high-level recap of Cluster Autoscaler. So first of all, Cluster Autoscaler's primary job is to provide nodes for all the pods running in the cluster so that they can schedule. It uses the scheduling requirements of the pods as a signal, so it looks at things like pods declared resource requests, CPU requests, memory requests, and so on, as well as things like node affinity, node selectors, pod affinity, and any other scheduling constraints. What it does not look on are actual metrics like real CPU usage of node or real memory usage. I also wanted to quickly cover one more Cluster Autoscaler concept, which is a node group. A node group is a resizable set of identical nodes and this is how Cluster Autoscaler interacts with the underlying cloud by modifying the node groups. The node group is mapped to different concepts on each cloud provider. So for example on AWS, it can be something like ASG on GCE. It's a MIG, a managed instance group, and in something like Cluster API, it can be a machine deployment. With that in mind, let's move on to scale down. I think the most important thing about Cluster Autoscaler scale down is that Cluster Autoscaler looks to remove specific nodes. It always operates on the level of node. It does not try to calculate a desired size of a node group. Why is that? Let me give you an example. So let's say we have four nodes that I have on this slide and I already have different pods running on those nodes and the different dimensions of the pods and nodes represent different resource requests. For example, we can imagine that on horizontal axis we have CPU and on the vertical axis we have memory. So the wider port would request more memory and the node would have finite resources. Basically, its height and width would represent how many cores it has and how many gigabytes of memory it has. So initially we have four nodes in this example and the question is whether three nodes would be enough to run all the pods or in other words can we remove one of the nodes? And to answer that question I'm going to try and remove one of the nodes and then try to simulate what the scheduler would do in order to see if all the pods would still be able to schedule in the smaller cluster. Let's say I'm going to start with node A and just basically see what happens if I try to remove that node. And I've done some simulated scheduling and I can see that in fact every single pod that was previously running on the node A could run on some other node in the cluster. So the answer to my original question is yes, three nodes are enough if we remove node A. However, let's try to remove node B now and do the same scheduling exercise. And this time the result is different. The pod E can still schedule on node A. However, pod D would remain pending in this scenario and the reason for that is while there are enough overall resources in the cluster to satisfy its requirements those resources are fragmented between multiple nodes and there is not a single node with enough resources for pod D to schedule. So as we can see the result of scale down or whether we can perform a scale down safely would actually depend on which node we choose to scale down. And for that very reason cluster autoscaler doesn't look at the entire cluster or the entire node group. It looks at each individual node and for each node it tries to answer the question whether this node can be deleted. And in order to do that it looks at three basic criteria. So first of all it checks whether a node can be deleted at all. It looks at something called utilization thresholds and it also checks if all the pods currently running on the node could be scheduled on some other nodes in the cluster. I'm going to drill down into all those criteria in a second but for now let me mention that in order to delete a node cluster autoscaler needs all those criteria to be continuously met for 10 minutes. Okay now let's look at those individual checks done by cluster autoscaler. So first of all cluster autoscaler would go and check whether the node can be deleted at all. There are multiple reasons why even an empty node could not be deleted. One of those reasons could be the mean size constraints that can be defined for each node group. Cluster autoscaler would of course not violate such constraint and so it would not delete a node if if it was blocked by minimum size. We also have an annotation which can be put on a node to opt it out from scale down and I think at the most common reason why a node would not be eligible for scale down is because it runs a pod that cannot be restarted and there are a few categories of such pod. So by default cluster autoscaler will not restart kubesystem pods and the reason for this is this may cause some sort of cluster wide disruption that we want to avoid. So for example restarting metric server can potentially bring down HPAs in the cluster for a while and restarting other system pods may have some other cluster wide consequences. Cluster autoscaler depending on how it's configured may also not delete pods with local storage for the fear of losing some data and cluster autoscaler will also not delete a pod that is not owned by a controller such as deployment and the reason for that is the cluster autoscaler is actually deleting pods and not restarting them. It relies on deployment or similar controller to recreate the pod so if it were to delete a pod that is not managed by a controller there is no guarantee that pod would actually be recreated by anything else. All those checks could be bypassed using the save to evict annotation. That annotation can be placed on a pod and basically gives you a way to explicitly say whether a pod can be restarted or not. So that's the first criteria. The other criteria we have is utilization thresholds. First of all let me start by defining what I mean by utilization in this context. As I said in the beginning cluster autoscaler doesn't look at actual utilization of the node. What it looks at is basically some of all the requests of all the pods running on the node divided by node allocable. So for example if I have two pods that request half a CPU and those pods are both running on a node with two CPUs that node would have a 50% utilization because two times half a CPU gives me full CPU which is 50% of two CPUs available on that node. You can configure a utilization threshold where cluster autoscaler would not delete a node that is below that utilization threshold. So for example if the utilization threshold for a CPU was 50% cluster autoscaler will not delete any node that has above 50% CPU utilization and this is separately checked for CPU memory and GPU and different thresholds can be configured. I'd like to quickly touch on the reason for this logic. The idea here is that in order to improve utilization we have to cause a certain level of disruption by restarting the pods. As the cluster average utilization increases there is a diminishing return on on further scale downs because the utilization is already quite high and at some point it may not be worth the extra deception cost and that that that moment will differ between clusters with different use cases. So for example on one extreme we may have a very static cluster which almost never changes and has the same exact set of pods for a very long time and just restarting them all a few times to get optimum utilization may be very much worth it. At the other end of the spectrum we have a batch cluster which is constantly being resized to run different sorts of batches and where restarting those pods may not be worth it if they are expected to end by themselves very soon anyway. And so with utilization thresholds you can control how aggressive the scale down will be. And the final check is what I also touched on in my example in the beginning and it's a check whether all the pods running in the on the node can run elsewhere in the cluster. And this is done exactly as I did it in my earlier example by pretending to remove a node and asking embedded scheduler code whether it would be able to schedule all the pods previously running on that node on some other nodes in the cluster. So let's look at one more example let's say I want to delete node C and it means all the criteria I discussed before. So in this case we can see that I could in fact delete this node because the only pod running that pod E could be rescheduled on node B. However if node B was deleted either by scale down or maybe by some other logic maybe manually by the user node C could no longer be scaled down and cluster auto scaler keeps track of this information in order to avoid doing a scale up which is no longer safe. Now let's consider a similar example with node B. Once again the pod data is running there could be rescheduled to node A and once again if node A was deleted cluster auto scaler remembers that it was going to remove node B by moving pods to node A and therefore it may no longer be safe to delete node B. In that particular case it is actually still safe to do this because the pods could also be moved to node C and cluster auto scaler will figure that out. However the 10-minute deletion timer which I mentioned earlier would be reset in this case because cluster auto scaler only keeps track of a single way of deleting a node and it no longer has the level of confidence if that particular destination is gone. Right so now that we quickly covered the algorithm I want to touch on some limitations of the scaled-down logic and I think there are two major limitations that it currently has so the first one is that cluster auto scaler can only drain one node at a time and when I say drain I mean the process of tainting the node and deleting all the pods running on the node which can take some time. Cluster auto scaler can delete multiple empty nodes in parallel though and then a node that only runs demon set would be considered empty in that context because the demon set doesn't need to be rescheduled elsewhere in the cluster. What that means is that deleting empty nodes is many times faster more than ten times faster in most configurations than deleting nodes that run some pods and in very large cluster it can be very important to optimize the scaled-down by just making sure that the pods terminate in a way that leaves a lot of empty nodes that can be scaled down quickly. This is this is less relevant in a cluster maybe dozens or a hundred nodes but becomes very important when we look into thousands of nodes. Another limitation is that cluster auto scaler cannot scale up and scale down at the same time it can scale up multiple different node groups it can scale up the same node group multiple times before completing the previous scale-up it can also scale down multiple empty nodes in parallel as i described however it cannot combine scale up and scale down. If one node will be scaling up this will unfortunately block the scale-down of any other nodes in the cluster. How can we optimize scale-down maybe tailor it to our specific cluster? First of all there are multiple flags that cluster auto scaler exposes some of the most commonly configured flags i would say are those related to the to how long we wait before deleting a particular node this is the 10-minute time that i mentioned before it can be configured with a flag to to any duration. There are also some additional flags that allow us to temporarily post-scale down after a certain event maybe if we try to delete a node and fail to we want to back off for a while or maybe we don't want to scale down immediately after scale-up. Another thing that can and often is worth configuring is the utilization thresholds i described earlier. There are also many other much more subtle flags that can control cluster auto scale and scale-down behavior but i want to give maybe a word of warning about configuring them i would recommend a lot of caution and careful testing because as cgauto scaling we don't necessarily test all those flags and there are some combinations which may lead to various problems. So one example i've seen quite a few times is increasing the number of empty nodes that can be deleted in parallel by cluster auto scaler can in some scenarios exhaust cloud-provided api call quota and this will block scale-down because cluster auto scaler can no longer make api calls and it can also cause all sorts of other problems so in this case the result may be in fact slowing down scale-up with scale-down quite significantly which is the opposite of of what changing the flag was meant to do. Another common optimization that i would recommend considering is grouping different pods on separate node groups so on separate set of nodes and this can be achieved probably in many ways the way i would recommend is using a combination of label entertained on the node and then a node selector and toleration on the pod. So node selector and label guarantees that the workload goes on to the node group it's meant to go on to and then prevents any other pods from going on to the same node group and common use cases for actually grouping pods like this is for example running a batch especially a large one and if the pods in the batch are likely to end at the same time this is likely to leave the nodes completely empty which will as i described earlier result in much faster scale-down of all those nodes. Another common setup is running pods that cannot be restarted such as kubesystem pods but maybe also some stateful pods maybe some database spots that generally cannot be restarted on a separate node pool and the idea here is those pods would block scale-down and we don't necessarily want them to block scale-down of nodes running other pods so it may be very effective to just spread them on a separate node pool that will never scale down but it will leave the other node pools free to scale down more easily. However there is a cost to all of that the more we partition our workloads into different separate nodes the more we limit the bin packing opportunities available to kubernetes scheduling so this type of optimization tends to be more worth it in a cluster that scales up and down rapidly and where the scale-down speed is important and maybe less efficient in a cluster where which is very static and where the the final like efficient configuration that is reached slower may be preferred. So this is I think what I wanted to say about the scale-down logic of cluster autoscaler. I also quickly want to touch on our SIG autoscaling. We maintain cluster autoscaler but we also work on other components such as pod autoscalers horizontal and vertical pod autoscaler. We meet every Monday at 4 p.m. central european and we have our own Slack channel which I would like to welcome you to and with that I'm I'm happy to take some questions now. As I said the questions about all autoscalers are welcome. Thank you very much.