 Hello, everyone. My name is Marcid Viegus, and I'm representing SIG Autoscaling Group here. First of all, let me explain who we are. SIG Autoscaling is a group of people within the Kubernetes community focusing on reducing the cost of running your clusters. Of course, we cannot negotiate your contract with your cloud provider. We cannot give you discount scolotes to buy cheaper hardware. We don't optimize your applications either. But we can make sure that you always have just enough instances of your applications. They are properly sized, and that the computing power provided to run those applications tightly fits their needs. In other ways, that you are not wasting resources. And at the same time, your applications are not overloaded. But having just enough resources doesn't only improve your efficiency. It also helps with the ease of deployment, development, and reliability, as your environment will adjust to the changes made to your applications. SIG Autoscaling owns a bunch of components that can help you to execute unmentioned goals. During this presentation, I will very briefly explain what these components do, so that the people who came here for some introductions get it. And then I will describe what are the newest developments there. The first component I want to talk is Horizontal Pod Autoscaler. Horizontal Pod Autoscaler is based on metrics that express the load and that the application gets. It can be either some real CPU usage or something more custom, like the number of queries per second, or other gauges like value provided by the user. HPA takes the value of this metric, compares it with the user defined target, and depending on this comparison, adds or removes replicas, hoping to move the metrics towards the desired value. Let's take a look on how does it work in practice. Here, the metric will be utilization, understood as the ratio between the actual CPU usage of a pod and the amount of CPU that the pod requested. As you can see, we have here four pods that are getting quite a bit of traffic. They are using around 90% of their requests, and if the traffic increases, they may not have enough resources to handle the requests. However, if we add yet another application replica, then the load will spread across more instances, and the average per pod utilization will get lower and will be closer to the desired target. If the situation is opposite and we have many instances that are not that much utilized, removing one of them will move the average per pod utilization closer to the desired target. And that's what horizontal pod autoscaler does. HPA has been around for quite some time. We reached stable version V2 back in late 2021. It means that with the accordance with the API deprecation policy, the beta versions are about to be removed. V2 beta 1 was removed in 1.25, and V2 beta 2 will be removed in 1.26. If you are still using the old definitions of HPA, please update them as soon as possible, because with your next cluster update, they may stop working. In 1.26, we finally made the controller multi-treader. It's quite embarrassing that it took us that long, but anywhere here it is. Increasing concurrency will help you with dealing with large number of HPA objects, especially if you are using custom metrics as obtaining those is really time consuming. Some users are heavily oversubscribing their cluster and use targets above 100% of utilization. Up to 1.26, we were prone to some corner case issues, but with 1.26, this problem should not be fixed. We still hope to finally land scale-to-zero support for custom metrics in the API with 1.26. And post 1.26, we are planning to have a dry-run mode which will allow you to test your HPA with different metrics without actually actuating the changes. OK, HPA is about auto-scaling a single deployment or to be precise, an object that exposes a scale sub-resource. The API allows only to provide a single target. What if you have a more complex use case? Well, then you have a problem, a problem that you share with other Kubernetes users, like how to ensure more or less equal spreading of pods in three zones of a region? How to guarantee that in case of a zonal failure, pods will automatically go to the other zones and move back when the zone is back online? How to split pods, let's say, in 70-30 ratio between spot VMs and regular on-demand VMs? How to make sure that pods consume the nodes with negotiated rate first, and then if they are not enough, go to the others? And how to make all of these deployments horizontally and vertically auto-scaled and make sure that they work great with cluster auto-scaler? Well, so far, there was no good answer to these questions. So we decided to provide some solution to the problem. As a SIG, we are bringing a new tool, a new controller, and a new CRD-based API. The central element of this API will be called Balancer. The Balancer object will have pointers to multiple deployments or anything that exposes this mentioned skill sub-resource. Each of these deployments may have a different node selector, different tolerations, possibly even different configurations. But what they have in common is that the pods from these deployments build kind of one service or one application. Balancer main task will be to properly size these deployments. Each Balancer will have a placement policy according to which it will distribute replicas across its target. For example, with the proportional policy of 7030, it will distribute 10 replicas like this. Seven will go to the first one that runs on spot and three to the other one that runs on regular on-demand VMs. Okay, now what if the cloud provider runs on spot instances and start to preempt virtual machines on which these seven pods are running as a result killing the pods? Well, after a configurable timeout, Balancer notices that the instances on preemptable nodes are not going back and increases the size of the second deployment in order to account for replicas that are failing on the first deployment. And it will still keep the first deployment at size seven. Why? Chances are that you are running Cluster Autoscaler. In case of Cluster Autoscaler, it will keep on trying to bring nodes for those seven pods targeted at spot instances. Eventually, it will succeed and the nodes will be provided and the pods started there. And the Balancer will decrease the number of replicas in the second deployment. And the whole thing will be in the desired configuration again. If you want to autoscale the deployments, you point your HPA at the Balancer. Balancer exposes the very same scale sub-resource as individual deployment. So it will work out of the box with horizontal pod autoscaler and vertical pod autoscaler. So what is the status of Balancer project? Which reach an agreement about the API within six. The code is almost done in the Google internal repository. Yeah, Google initially wanted to start it on GKE first but changed their mind in flight and it will be open source in November, assuming that all of the open source code reviews will go smoothly. So we expect the initial release sometime this year. The next thing that seek autoscaling owns is a vertical pod autoscaler. Vertical pod autoscaler helps you to get the pod size right. It is based on actual historically resource usage of the pods. It looks at CPU memory usage and pays attention to out of memory events. It recommends the pod or actually it's container sizes to keep the real usage within the requested capacity. So if a pod is using something like 95% of its current request and maybe even occasionally going above 100, VPA will increase the pod slash container size. In the situation is opposite, it will decrease the pod size. Vertical pod autoscaler has also been around for a while. But despite that seek managed to make a couple of important improvements. The biggest one is probably the ability to have multiple recommenders running at the same time. Each recommender, which recommends what should be the pod size may have a different configuration or even a completely different algorithm. And you can decide which one to use by providing its name in the VPA object specification. To support this feature, you can now provide the percentiles used by the standard VPA recommender. While the default 90% plus a little bit of buffer works for quite a lot of user, you may want to increase this should your workload be more spiky or and you care about latency or maybe decrease it if you want to oversubscribes your nodes more. Soon we hope to have an ability to keep a fixed ratio between a CPU and memory. We want to limit the direction of the updates so that for example, containers and pod only scale up. And we really, really hope to have this Kubernetes in place pod updates landed and then we will use it in VPA so that it doesn't restart your pods while performing the updates. The last component I would like to talk is cluster auto scaler. Cluster auto scaler ensures that your pod always have a place to run. It provides new nodes for the pods that could not be scheduled and removes nodes that are not that needed anymore. It doesn't use any metric. It uses pod declared request and a lot of scheduling simulation to tell what would happen if some actions were made. Let's take a look at this in more detail. Here, we have four nodes with pods. If a new pod arrive, it can be placed on that third node. But if the situation is different and all nodes are kind of busy, then the green pod has no place to go. The scheduler marks is unschedulable. Cluster auto scaler waits for the signal. It makes some simulation and notices that if one extra node was added, then the green pod could go there. So it talks to your cloud provider and resizes the cluster accordingly. The new node shows up and scheduler places the pod right there. And let's take a look at the different setup. Here, we have two nodes that are not used to its full capabilities. But if we move the green pod to the third node, then the fifth node could be deleted without any problem, lowering your cloud bill. So what's new in cluster auto scaler? The biggest change is making the scale down process faster. It includes better handling of pending pods that don't block scale down anymore. And most importantly, ability for cluster auto scaler to handle multiple node drainings or migration at the same times. Previously, cluster auto scaler could delete empty nodes in bulk, but if nodes were not that empty, like on the examples that I showed you before, it had to migrate pods from them handling only one node at a time. With the changes that we are making to the scale down algorithm, we hope to get significant scale down speed up and more savings for you as the unneeded nodes will go away faster. We are expanding the pluggability of cluster auto scaler by allowing you to have both GRPC cloud provider and expander. We'll talk about GRPC cloud provider in a moment. And we are working on better clouds, better batch support use cases by integrating cluster auto scaler better with job and Q that has been mentioned like before on C scheduling updates. So as I said, we wanted to tell you more about the GRPC cloud provider, but unfortunately Diego, the author of the change could not be here in person. So I have a video of him instead. So let me play it for you. Hello, I'm Diego Bonfili, S3 at Cisdig. And with this presentation, I will talk about the new GRPC cloud provider. These are plug-in system to implement cloud providers as a separate process from the cluster auto scaler. Just let me do us more refresh on what a cloud provider is in the context of the cluster auto scaler. Or CA from now on. The cluster auto scaler adds nodes when pods cannot fit on the current Kubernetes nodes or remove nodes when resources are underutilized. The logic behind these scaling decisions are common for any environment. But at the end, the CA needs to interrupt with the specific underlying infrastructure like create any hosts, remove a VM, remove the list of the current instances, and so on. And this is done by calling APIs for the specific cloud provider where the cluster is running. The cluster auto scaler supports many cloud providers, I think at the moment, almost 30, and a specific implementation for each cloud provider is coded in the cluster auto scaler itself. This is abstracted by a couple of Golang interfaces and runs in the same process of the cluster auto scaler. So here, in the context of a cluster auto scaler, a cloud provider is the specific implementation that lets the cluster auto scaler talk to the cloud provider APIs. What do you need to do at the moment if you are a cloud provider and you want the cluster auto scaler to work with your services? You need to fork the cluster auto scaler code, implement a couple of Golang interfaces with your custom logic, which most probably makes use of specific cloud provider APIs, and you have to change some pieces, of course, to integrate your cloud provider. If then you want to contribute back to the community and you want your fork to be merged back to the official cluster auto scaler, you must respect some rules, like you cannot add new dependencies at a level vendor. I mean, in the go.mode file of the project. And this is because of the problem with version conflicts in transitive dependencies. They are hard to understand and create problems with version upgrades. And then, of course, you have to wait for a member of the Kubernetes organization to review your code if you have not an official maintainer. You can understand that it could be useful to have a pluggable cloud provider, something external to the cluster auto scaler core that implements only the things specific to our cloud provider and leaves the core logic. I mean the scaling decisions on the cluster auto scaler. With the cloud provider logics moved to a different service, then there is no need anymore for a fork of the ACA. If, of course, you cannot implement your cloud provider as an official cloud provider in the project itself. And this is good because it simplifies maintenance. I mean, the maintenance of forked projects is not always straightforward. For example, you need to keep track of upstream updates. You need to integrate them. And then you need to carefully bet them to understand the new dependencies back your fork. Also, now you can release new versions of your cloud provider whenever you want without waiting for a new official CA release. You can use libraries that you could not use before, for example, because they are not under the patch license. And you can also use your own language of choice if you want to avoid go long for some reason. This is no difference from many other Kubernetes components that are now structured as plugins. Think of the CNI, the container storage interface, the cluster API for provider implementers, and so on. ClusterAutoScale now has a new plugin system for cloud providers. So now you can build out of three cloud providers. If you use it, the clusterAutoScale retains all the core logic for scaling, or the specific piece of code for a cloud provider is served by a separate service over the network. The communication between the CAI and the external cloud provider service is performed by a GRPC. Technically, the plugin system is yet another in three cloud providers. This new cloud provider called external GRPC cloud provider implements the go long interfaces like all the other in three cloud providers, but then it actually wraps these function calls and send them to an external service GRPC. So the core CAI takes the part of the GRPC client here and external cloud provider service act as the service, server, sorry. Okay, we have a plugin system and now we want to use it. Let's talk about what you need to do to create a new cloud provider using this plugin system. Let's digress for a moment on some general requirements that all cloud providers needs to meet, both in three cloud providers and the new out of three ones created with the plugin system. First, in the clusterAutoScale, there is a concept of node groups. To scale, the CAI works with groups, no single nodes. This means that when it needs to add nodes, the CAI actually choose a group to scale up. So it's important that all nodes within a group have the same machine type, same labels, same things. And in this, are in the same availability zone for the CAI to properly decide which group to be. This does not mean that if a cloud provider does not provide groups APIs, then it cannot be integrated in the CAI. For example, your implementation can fake those groups, but it helps. When I was scaling instead, the CAI deletes specific nodes in the node group. So a cloud provider must provide a way to delete a single node and also resides the group at the same time. And also for this to work, there must be a way to correlate a Kubernetes node to the actual host of the cloud provider. And usually there is a node field for this called provider where cloud providers add information to correlate Kubernetes nodes to cloud provider hosts. All these requirements I said are important because if you want to create a cloud provider, you will need to implement APIs that assume these concepts. You can see a summary of RPC's APIs on the left here. They profile as docs describing what single RPCs and messages are used for. So to write a cloud provider as a plugin, pick your language, create a GRPC server that implements that part of the file, write the logic for your cloud provider that most probably will in top perform calls the cloud provider itself and expose the server. This is very important to use MTLS for this even if you can switch it off for development purposes because if you look at the RPC's, you are essentially giving the permission to create and delete nodes in your cluster to whoever is able to connect to this server. So please use MTLS in production environments. So we now have a nice way to the couple of cloud providers from the core CA. Here I report some things to know before using the plugin system. One thing to take into consideration is performances. On a in three cloud provider calls are of course local and so they are as fast as they can get. With external GRPC cloud provider calls now go over the network. Caching for RPCs has been implemented everywhere possible of course, but still at the moment the plugin system has not been tested yet for very large clusters. Fin cluster with thousand nodes. So take these into account. Another thing to know is the external RPC cloud provider has slight differences on some function with respect to in three cloud providers. Keep in mind that if you want to escape from zero in a group meaning groups with zero nodes in some specific circumstances when you have mirror pods, the calculation performed by the CA to understand if a pending pod would fit a new node could not be correct. It could be slightly off because the information about the full pods for a node are not in the GRPC cloud provider. Another mild thing to know is that the function get resource limiter is not available, but its use is really limited and almost no cloud provider implements it anyway. And don't be confused by some missing functions in the plot of file with respect to the Golang interfaces because some functions are indeed deprecated in the class in Golang interfaces and so have not been implemented as RPCs here. Okay, we are at the end of the presentation. If you want to take dig pair here, there are some links. You will be able to download these slides online. So you will be able to look at the links. That's it. I hope this presentation was useful and that you will use this new play system. Thank you for your time. Okay, so our presentation is slowly going to an end. So I would like to give you some more details about seek out the scaling if you have some ideas for improvements, questions, comments or you want to contribute. So we have meetings every Monday at 10 p.m. Eastern Detroit time on Zoom. We have a Slack channel on standard Kubernetes site and most of our code sits in Kubernetes slash Autoscaler repository. And I would like to thank you for coming this late in the presentation showing up. That was really amazing. And now is the time for questions. Question regarding the vertical for Autoscaler. So does what Cloud Autoscaler supports the custom metrics like example of Datadark, right? So does it supports? Sorry? Datadark metrics on the VPA, the external metrics for the... VPA doesn't support external metrics? Yes. No, it doesn't support, but if you modify the recommender to pull this metrics from your data site, the rest of the environment will be okay with it. So if you want to have external metrics, you'll need to contribute some code. Okay, so we have to use just regular metrics? So it gets metrics from a metrics server and stores them as a histogram locale in TCD. So we've got a snapshot. However, if you wanted to use something or some other source, then you need to do it by yourself. There's some code around for getting the metrics from Prometheus, I can probably point you to it. You have another question regarding Cluster Autoscaler, right? For example, in the Cluster Autoscaler, for minimum, we put like four nodes, a maximum equal to 10 nodes, and four nodes are using around 80%, but we put the threshold as 70%, if it is 70% less than that, remove the node, right? So four nodes are using around like 80% and the fifth node is using around 30%. So how does the Cluster Autoscaler behaves? Is that it will delete the fifth node or it will tries to move those like? So as long as all your pods are scheduled, it will do nothing. So it will pack your nodes completely as long as the pods feeds there. If the pod cannot be fitted to your node and scheduler marks the pod as unschedulable, then Cluster Autoscaler kicks in, analyzes your cluster, analyzes what is your configuration and checks whether adding new node will help. It adding new node usually helps, but if you mistype for example pod size and you put 400 instead of 400 millicores, then obviously your cloud provider will probably be unable to provide you a node with 400 CPUs and 12. Cluster Autoscaler doesn't even try to do anything. Did you consider a CRD based solution instead of the GRPC solution? For the Cluster Autoscaler for the providing support for different clouds? CRD, so Cluster Autoscaler does a lot of simulation and it involves a lot of calls to see what if. What if we added that node? What if we added another node? So it treats scheduler a little bit of like a black box. So it doesn't look closely what is in pod specification. It just shows it new nodes and checks whether scheduler will put it or not. Okay, so doing it with CRD is either impossible or will take ages to do communication via API server. Yeah, so regarding the instance types available on the cloud providers, right now everything is hard coded into the Cluster Autoscaler code and it has to be maintained. So for example, HM, Usage AWS and AWS releases a few new instances that I wanna use Cluster Autoscaler is not going to have support until somebody actually patches that into the code. Have you considered automated ways to actually call the providers to get the information about the instance types that are available, number of cores, memory, et cetera so that you don't have to depend on this manually updated list? So it is up to cloud provider implementators as someone who coded it, how to handle it. So on GKE, we have this process automated but other cloud providers are for some reasons not adding this type of support. If you're interested, feel free to contribute to the cloud provider that you are running on and will definitely accept those types of patches. So they are not like against getting current configuration current setting and so on is more than welcome but well, sometimes people implement the cloud providers in a simple way and we cannot force them to have it working better. I was aware that the GKE was automatic so it's... Yes, so the GKE creates automatically and not force if you configure not auto provisioning. For the cluster auto scaler, we've seen no scale down when there's one pending pods scaled up to a couple of hundred nodes, no scaling down for one pending pod. You said that has changed in 124. Yes. Does that mean that you just completely ignore pending pods or is the only pending pods that are marked as unschedulable or... So we analyze cluster more carefully and we are checking whether removing nodes will not make the life of this pending pods harder. So if it's like completely independent and the spot has no chance to run on this node that we are removing anyway, then we are okay with removing it. However, we don't scale down if our scheduler has not yet managed to process the pod and the pod is likely to benefit from those nodes being around. But for instance, the case that I've got one pending pod left and I've got a hundred unused nodes, then in the current version, you do not scale down any node. You leave that a hundred nodes. Has that changed? Yeah, that is fixed. That is fixed. Okay, thank you. Dear, if no other questions, node groups. Carpenter is a different kind of outer scaler. They've completely dropped the whole idea of node groups. Has that been given thought for the cluster outer scaler? So yeah, we were thinking about adding it. So why do we have node groups? It is because Kubernetes in early version was running on node groups and we are having cluster outer scaler since 2016 or something like that. And most of the cloud providers operate with the concept of node group, node push, outer scaling group or whatever. The second thing is in order to scale up a cluster, we need to know what the new node would look like. You can either have this knowledge be inferred from the existing nodes and or you need to hard-code it. So you can get rid of this whole concept and bring new nodes. If you hard-code a lot of code and that figures out what would be the good node for that particular pod, what is the cloud offering? What nodes are supported? What are the prices so that you don't get like the most expensive one and so on? So a lot of very, very cloud-specific code that needs to get in. And that may work if you have a group of developers that are set to that particular project of bringing up this hard-code to Autoscaler and they are maintaining it, keeping it up to date and so on. However, if we have 30 cloud providers starting from AWS to some that you barely have heard about, that's not a valid request to ask everyone to code that complex logic into their cloud provider. We want it to have more or less uniform experience, a class the cloud and cluster Autoscaler mostly works the same on Azure, GKE, AWS and some other clouds you haven't heard probably about. And yes, on GKE, we made this effort of coding a feature that we called node auto provisioning. So it creates a new node pool for you should there's node pools that you have in the cluster were not enough like were too small or didn't have the requested GPU but that's a lot, a lot, really a lot of cloud-specific code that even if we open source, it will be hard for the other cloud provider authors to benefit from because they will need to do the very same coding but specific to their cloud providers. So for that reason cluster Autoscaler is like it is if you want carpenter experience then please come to GKE and if someone wants to devote their time to coding providing this node auto provisioning experience to other cloud providers, we are very happy to give you support and tell how to do it in cluster Autoscaler but the warning is that well, it requires quite a bit of work to get it done properly. Yeah, so AWS has chosen to just do it in carpenter. Sorry? AWS has chosen to just do it in carpenter so that's why they've created carpenter. Yes, they've chosen, they put engineers on it and well, in Tori Carpenter is open source but it will take a lot of effort in order to make it work on other cloud provider because they need to implement this whole logic that is very, very specific to cloud provider and also where you create the cluster, how you expand it, how is it built, what's the offering, how to create new nodes and what they will do like, like a lot of code I will keep seeing your cross for carpenter but it will be hard for them to replicate it on other cloud providers. So the cluster Autoscaler is aware of node groups and has some awareness of cloud providers for the HPA and the VPA. Is there any like talk in the community of sake auto scaling about adding some support? And the reason I ask is because you may have a heterogeneous set of nodes that your application is running on and because HPA or sorry, because VPA provides a single recommendation for CPU and memory, it may not fit all of your nodes that your application is running on with the same request. So I'm just wondering if there is any discussion there about fixing that or... So the question is about integration of workload controllers with cluster Autoscaler or how to run your workload in heterogeneous environment? Yeah, the latter. How to run your workload? So I guess either you could run your workload in a different way so that you're only running on a certain node group or VPA could have per node group recommendations. I guess that's kind of my question. Okay, so running your workload on significantly different nodes will cause you problems because the application on one node may run orders of magnitude faster than the other and that will confuse horizontal put Autoscaler what is actually going on. So you can run your application on a node that don't differ that much. Probably we can have like 10% of performance differences without any problems, but if the other node is like three times better than the other, then historical data gathered by vertical put Autoscaler will not be valid. HPA will be confused whether you are above or below the target and whether it should give you more replicas or the rest replicas. It may actually shrink your replicas and leaving one that is on this poor node, super overstraining and causing some type of errors for customers. So at this moment running workloads on very heterogeneous nodes is rather not recommended with Autoscaler and probably even without Autoscaler too because you need to size the puts somehow and sizes will be different for different types of nodes. Okay, got it. One last question here. Sorry, this might be kind of a newbish question. In terms of HPA we're currently utilizing it pretty well but VPA is fairly new to us from a usability perspective. When it comes to HPA, a lot of GitOps pipelines are just flux in terms of replication come into a problem of like three way merges where like flux will give a static replica count and HPA will give a dynamic or different replica count and you'll get like a change has already been applied. Please try again later error on some signs. Do you, does that same issue ever occur on a VPA perspective in terms of requests and limits that you know of or has it ever been reported at all? Sir, I'm afraid I didn't understand your question. Oh, my fault. So for things like flux, when applying a specific workload or configurations typically what ends up happening is the actual GitOps approach of applying a static replica count in terms of like a sync from a GitOps repository will apply a static replica, say one or two. HPA will scale that number dynamically via the Deployments API to let's say three or four depending on capacity or usage. But what ends up happening sometimes is a three way merge problem where flux will keep attempting to reapply to or reapply a lower number than what HPA represents. I'm wondering if the same thing happens with VPA at all with vertical part of the scalar with requests and limits specifically because I know what flux specifically ended up doing to mitigate that was to ignore the replica count. So I'm wondering if that's something that other providers have talked about? So regarding updates in a vertical part of the scalar, vertical part of the scalar doesn't update your deployment. It updates spots in the admission phase. Oh, I see. We didn't do it because updating deployment will cause pottery creation. And it's not necessarily the thing that you would like to have at the moment. VPA has a couple of operating modes. So you can have only recommendations without any actuation. That's for your information. You can see what would VPA give to your post and apply it manually or completely ignore. I see. Other option is to have it only on creation time. So VPA gives you recommended size when the pod is created. However, it doesn't touch existing pods. So if something is running, it's running fine. We don't touch it even if we wanted to give a little bit more CPU. And the third mode is automatic when we update pods if they are outside of the reasonable values for pod sizes. So not only they are a little bit different from the recommended value, but different so that we think it's worth to update them. And even then, we respect things like pod disruption budget and we try to do it slowly. On GKE, we also have integration with cluster auto-scaler. Updating pod in updating sizes in deployment will ignore pod disruption budget and cause immediate rollout of the new version which will be quite disruptive and definitely more disruptive than our current process. So yeah, if you are running VPA from time to time, you should update your deployment so that if VPA is idle, for some reason not running or you want to have some idea what would be the size of the deployment if created and you should take the values from VPA object and put those in the deployment. Gotcha. So to reiterate, it only messes with the pods from the deployment. It messes with the pods leaving the deployments itself alone. I see, that provides a lot of context. Thank you.