 Welcome to KubeCon 2021 North America SIG autoscaling presentation. My name is Joseph Burnett. I'm a contributor working on horizontal pod autoscaling, and I'm going to walk you through a few basics of Kubernetes autoscaling today. I want to tell you a little bit about the SIG as well. I'll give you an example of what Kubernetes autoscaling does, so you can understand some of the terminology and how to use it. And I'm going to talk a few minutes about some upcoming features, and I'll save some time at the end for questions. So who are we? SIG autoscaling has the scope of horizontal and vertical pod autoscaling. This is changing the size of pods vertically, or making more or fewer of them horizontally. Cluster proportional system component autoscaling is just a very simple autoscaler that adjusts the number of pods in a deployment based on the size of the cluster. So for example, Kube DNS uses this. Also SIG autoscaling is responsible for cluster autoscaler, which actually sizes up and down the cluster itself. So I'll talk a little bit about how those interact with each other in a few minutes. The chairs of SIG autoscaling are Guy Templeton, who works at Skyscanner, and Marcin Navigles, who works at Google. And here's a link here to the community page, which should be linkable from the PDF uploaded with this presentation. This will tell you all about the SIG meetings and all the information about this special interest group. Okay, so let me start with a quick overview of the autoscaling space in Kubernetes. There is two dimensions here to consider, horizontal and vertical, and they're separated by two layers, pods and nodes, or workload and cluster. So HPA scales in and out. I mentioned already, it makes more or fewer pods based on metrics that you pre-define. You can scale on more than one metric if you want. For example, CPU utilization, maybe some pubs of Q-length or something. Vertical pod autoscaling is for those workloads which don't necessarily scale horizontally as well, or maybe you would prefer to give more or fewer resources to a pod to adjust for usage, or maybe you just don't know what to request. VPA can actually give you recommendations in a read-only mode, or you can tell it to just go full automatic and resize your pods for you. Cluster autoscaler has the same kinds of autoscaling in and out, making more or fewer nodes in any of the node pools that are configured. Vertical autoscaling in cluster autoscaler is more about selecting the right node pool. So it says you have a pod, where do I want to put that? It knows not to scale out a small VM size, it needs to scale out one that's big enough for the pod. So there's an aspect of vertical autoscaling in cluster autoscaler. So I'm going to walk you through kind of a quick example cluster, what it might look like when you have these autoscaling systems engaged. So there's two workloads here. One of them is blue workload, which is configured to scale horizontally. And there's a purple workload, which is also configured to scale horizontally and vertically, maybe vertically on memory, horizontally on CPU, or something like that. Side note, it's not a good idea to scale horizontally and vertically on the same metric, because you'll have two systems trying to optimize the same value. And it's just they can have undefined behavior. I'm also going to show you how cluster autoscaler sort of adds capacity and brings it back in, which is actually a really important part. So let's suppose that you get some extra traffic for this blue service. So the HPAC is that you are consuming slightly above the target average, maybe like, you know, you were shooting for 60% and you're up at 75, 80% CPU. So it figures out how many more pods you would need it in order to bring that back down to your target, assuming that all the pods pretty much do the same thing. And so it turns around and increases the desired replicas. The way that it actuates this is by updating the scale sub resource of the target. So you can point HPA actually to anything that implements the scale sub resource, which includes, of course, deployments and replica sets. But you can even point it at custom resource definitions, CRDs. As long as you implement the scale sub resource, HPA can scale it. There are some conventions you should follow like give the correct label selector to the pods that you create. But it's pretty flexible in that respect. So anyway, HPA doesn't create the pods itself, but it does tell the underlying system to create more. So the blue service scales out, deployment scales up, replica scales up, pods are created. They get scheduled on those node that has capacity. Pretty normal. So suppose that the purple service also gets some traffic and it scales up too. As you can see, there's not enough room for that purple pod on that third node there. So it actually gets created unschedulable. And this is this first signal that Cluster Autoscaler responds to. Cluster Autoscaler is going to see this unschedulable pod, and it's actually going to create a new node to put it there. So you can see it's just got this flat node pool here. It just made another node like the others just by scaling up the underlying resource, the VM management system. And this pod immediately gets scheduled. That's great. Everything's happy. Now, let's say that VPA kicks in. Now, VPA is a little bit slower running than HPA. It spends its time looking at the usage over time and building a histogram, and it picks a conservative place of what it thinks you probably need based on historical usage. VPA has decided that the purple service is way over provisioned on memory. So it is recommending a much smaller size. So the VPA is actually divided into three different systems. One is the recommender, which is just responsible for this observation and recording its recommendation. Then there's an updater system. The updater is actually an active process that just looks for pods that are not matching their recommended resource consumption. And the updater is responsible for going through and deleting those pods. So one of the most important aspects of the VPA updater is it doesn't delete them all at once. It's very careful to delete them and try to respect some reasonable limits. So it deletes a pod, not all, just one of them, with the expectation that it will get recreated, which is how Kubernetes works. And when it gets recreated, there's another system. The third system is an admission controller that actually rewrites the pod resource request for that purple pod to make it smaller. And so it gets scheduled at the correct size. And that is basically how VPA actuation works. So the same thing is going to happen to the other one. The updater is just going to kind of keep moving along. It gets deleted, boom, rescheduled. It actually fits really nicely inside this second node here. So you notice we ended up with an unused node, which is great, actually, because the cluster autoscaling will then periodically scan the cluster. It'll see that this is node is empty, and it'll be like, delete it, and then our cluster scales in. All right, suppose that now the day is over, traffic is going to go down, blue service gets less traffic. HPA notices that this target utilization is lower than the target. So we're not using as much CPU as we would like to on average. So it will, again, figure out how many pods it's going to need to hit the target and adjust accordingly. It's scaled down by three pods. And then the underlying deployment or replica set leads three pods. So there they go. And you're back on target. Yeah, so that leaves, that lecture leaves them one pod on this node. And you can see it's pretty obvious to you that as a human that you can see this room on that third node there. So cluster autoscaling is actually going to kick in after a while and do some consolidation, defragging. And it's going to delete the pod that's its first thing it does is it actually tanks this node that it wants to get rid of to make sure nothing shows up, nothing new shows up there. And then it evicts the pod, it deletes the pod, pod gets goes away and comes back as usual, it gets rescheduled where it can, which is on that third node. So now again, we have this node that is empty, and cluster autoscaler can delete it. And then we get this final state. So you can kind of see how both of these systems scale in and out. I didn't show you multiple node pools, but cluster autoscaler is capable of selecting from different sizes of nodes. And you can kind of see how they interact with each other. Cluster autoscaler only responds to unscheduled pods. It's not concerning itself with utilization and all the metrics, all the custom things you could do for HPA or the histogram of usage like VPA. It's just thinking about putting, giving every pod a home and defragging and removing unused nodes. So that's essentially Kubernetes autoscaling in a nutshell. Now, if there was a cluster proportional autoscaler here, it would be creating more pods when there are more nodes and fewer pods when there are fewer nodes, just very, very simple arguments used for system components because it is so simple. It's sort of a baseline. Okay, so let me take a few minutes and talk a little bit about some upcoming features. So HPA has actually been around for a long time. It's the first version of HPA, HPA v1 autoscale only on CPU utilization, which is what most people use. It's the easiest thing to use. It changes quickly. It's compressible. You know already what you're, how much you have. So it's easy to configure, right? You could just give it a percentage or a utilization target. And you already know what the capacity is because you know what your resource request is. So it's a really great choice, but it's not sufficient to cover every use case. So HPA v2 started with v2 beta 1 and there is a v2 beta 2 with slightly different structure. Both of them are roughly equivalent. That was created in order to provide multiple metrics and custom metrics. So the structure of the HPA changed a little bit. Inside spec in the HPA object, there's a metrics field. And the metrics is a list. And it's a list of all the metrics you want autoscale on. And the algorithm will actually, the HPA algorithm will compute a recommendation for each one of those metrics completely independently. Okay, so CPU says probably we need seven. Your custom metric says we need three. And your other custom metric or memory or whatever says you probably need 10. And then it will select the maximum from those. It's sort of optimized for safety to give you enough capacity. It selects the maximum, which would be that 10. Recommendation. And then it goes into a phase where it actually applies some limitations. So it limits you to a min and a max. You can configure a min and max in the HPA. And you can also configure now how long you want to wait to scale up or scale down. And also what rate limit you want to apply. So let me give you an example. HPA v1 has an algorithm called scale down stabilization. And you want HPA to be responsive, right? Like traffic comes in, CPU goes up. You really want to give it capacity as quickly as possible. It's sort of optimized for services. Not necessarily for patch. And I'll get to that in just a minute. So you want to be able to scale up right away. So you want to react quickly. So you see the recommendation go up, you go up immediately. But you also don't want to flap because the metrics, they kind of go up and down. You don't want to keep creating and deleting pods because it creates a lot of churn. So in order to prevent that churn, there's a stabilization period during which the maximum recommendation for a period is selected. So you have to wait and be sure that you want to scale down. The default is five minutes. So for example, if I sort of go up and I kind of wiggle around and then come back down again, I'm not going to, the HPA is not going to scale down until five minutes afterwards. It'll just sort of look over the last five minutes and select the maximum recommendation for that period of time. So this is really actually a really great algorithm for most use cases. But again, it's not sufficient for all. So in HPA v2, there's a new feature added called behavior. So also in spec, there is a behavior field. And in behavior, you can configure symmetrically scale up and scale it down configurations. That includes the stabilization period. The default, of course, is scaled down five minutes. The default limits for scaling up is actually a rate, which is the maximum of either four pods or two times. So if there's a real quick spike in the metrics, it wants to be sure not to overreact. So it has some defaults that were previously hard coded in the HPA. Well, you could configure those now. Likewise, you can actually configure a scale up delay if you want. An example use case might be if you're processing batch work off of a pub sub queue. And maybe you have like two or three pods, and you get a whole bunch of messages all at once. You might want to actually just in order to make sure you make good use of your resources, you might actually want to wait and see how many of those pods that you have can get through before you scale up. This was sort of the canonical use case used for that cap. And so maybe your pod, two pods can handle all hundred of those messages, even though you like are wanting, if you're sort of at a higher rate, maybe you know that they need to be having fewer per pod. So scale up stabilization is a good thing to be able to configure. And scale down rate also. I mean, scale up rate is kind of obvious because you don't want to like overreact. But also scale down rate limits are really useful just to make sure that you come down off of your scaled down or the traffic drops you want to come down off of that sort of gradually. Maybe you want to try to smooth out the spikes or maybe you want to make sure your capacity is not going to be needed again. Or maybe you just simply have limitations on how fast you can get rid of pods. Maybe you have a database connection you don't want to like overwhelm it with something. I don't know, whatever. So anyway, HPA v2, it's been around for a long time in beta and we need to graduate it to stable. So we've been working on this for a little while, mostly just through just community contributions, a bunch of people have chipped in to create the cap, to create and end tests for the new features, to fix a few bugs, and to create the new API, which is really exciting. It's getting along there and I think it's like almost ready to commit. There's just a few more comments left on it. And I think that'll be the first v2 API for Kubernetes, which is cool as far as I know, correct me if I'm wrong. Yeah, so HPA v2, v1 will stick around for a while. I mean, it's GA. V2 is going to be there with the next release of Kubernetes. And then we're going to start duplicating the v2 beta versions. The structure is pretty much the same. There's only a few small tweaks that were really necessary. For example, you can configure multiple policies for scale limits. And in order to, like for example, an absolute member of pods or a relative member of pods, like a percent or a count. And in order to choose which one should be controlling, there's a select policy enumerator. And the values are min or max. So if you're scaling up, it's going to select the max of four or two times. But if you're scaling down, it's actually going to choose max which is the smallest number because it's actually the max change or the min change. So we've changed the value of the enumeration because early users of this feature were confused by it. But that I think is the most significant change. So it's mostly the same as the beta API. And that should be coming soon. There's another feature coming up which is in the VPA space, support for multiple recommenders. So just in the same way that HPA need to be made more flexible to support multiple use cases. This was HPA was mostly extended through metrics. VPA is actually going to be extended largely through the actual recommender engine. And because, you know, I mentioned already, VPA creates the histogram and chooses a conservative place. But a lot of those depend on a workload. There's a lot of edge cases there that might be different defaults. And so VPA is just going to have a discriminator, a new field which is actually struck where you provide the name of the recommender that you want to handle this. This is a common pattern for when you have multiple controllers for a single object. You pick a field, the sum discriminator, and then there'll be the default implementation will handle empty values or zero values. Same thing, VPA, the VPA recommender will handle when this field is not present. But if you want to add a custom VPA recommender, you create a controller that handles those. It just ignores anything that doesn't have its name on it. And then you can create a VPA object with that discriminator and target that recommender. And then the rest of the pipeline will be the same. The updater will see that the pod doesn't match the recommendation. API is the same, so it doesn't matter what recommender produced it. And then the admission controller will actually apply those updates when the pods were created. So it's a really nice extension. I have a link here to the cap and a couple of pull requests that are implementing it. And that is also nearing completion, so it should be committed soon and available in the next next release of Kubernetes. So those are two interesting upcoming features that you can plan to take a look at. And that concludes most of my content. I wanted to give you a few pointers to, like, how to get involved in the auto scaling community. We have a slack channel, Sig Auto Scaling, where you can just drop in and ask general questions. How do I use this? Why is this thing doing that? I have an idea. What do you think? We're really quite friendly. People are asking questions all the time on there, and there's always somebody that jumps in and answers questions. So it's a great place to come and just get plugged in. There's another slack channel, Sig Auto Scaling API. It was created partly to coordinate the HPA v2 changes, because there's just a lot of different contributors, and I wanted to have a long-running thread for it. So API-related auto scaling stuff, the topic is currently v2 GA or stable HPA. So come check that out, too. And of course, it's always nice to just talk face-to-face or virtually and just kind of bounce ideas, see people. So there's a regular weekly meeting. It's at 4 o'clock, 1600, CET time zone, and there's a link there to the Zoom meeting and to the agenda. So if you want to talk about something at the meeting, just go to the Google doc there linked under agenda and just put your name there and say, hey, here's something I want to talk to. And then usually in the SIG meeting, we just go in order from top to bottom. And there's usually plenty of time to ask additional questions, too. So come see us. We're friendly. We're interested in new ideas, and we're interested to see how you're using these new features. And yeah, it's a great community. So that is it for my presentation. I hope you enjoyed a quick overview of auto scaling and some upcoming features. And I hope to see you online. Bye.