 Welcome to SIG Autoscaling. This talk is going to be both a quick introduction to Kubernetes Autoscaling, and a little bit of a deep dive. We're going to cover some best practices for running Kubernetes Autoscaling in production. My name is Joseph Burnett. I'm here with Guy Templeton, who is the co-lead of SIG Autoscaling. He'll take over in the second half and run you through some practical tips. So this is the agenda. I'm also going to touch on a couple upcoming features, so you can get a preview for what's coming in the future. We'll save about 10 minutes for questions at the end. Okay. So Kubernetes Autoscaling, like the rest of Kubernetes, is sort of divided into two levels, the workload, which is the pods and the infrastructure, which are the nodes. And the workloads can scale out or in. You can make more or fewer of the same kind of pod. Then they can also scale up and down, which is making the resource requests larger or smaller. And the corresponding autoscalers are the HPA and the VPA, the horizontal product scaler and the vertical product scaler. So the same concepts of vertical and horizontal apply to cluster autoscaling as well. Cluster autoscaler scales a node pool in and out by making more and fewer nodes. And it's also capable of scaling vertically by selecting different node pools. So if you have node pools of different sizes in your cluster, it can make intelligent decisions about which one to use. But to kind of make it real concrete for you, I'm going to walk through a practical example. So here we have a happy little cluster. You have two workloads deployed here. There's the blue pods, and there's the purple pod, and they're deployed on three nodes. Now, both of these workloads are autoscaled. Let's say that the blue and purple workloads are autoscaled horizontally on CPU utilization and vertically on memory. And the cluster is autoscaled as well. So when there are new nodes, new nodes will be created when new nodes are needed. So suppose that blue service gets some more traffic and the HPA will notice that the average utilization is slightly higher than the target. It will figure out how many pods it needs to create in order to bring that back down to the target value, and it will cause those to be created. Now, the HPA doesn't actually make the pods here. What it does is it looks at the scale target ref. So it takes a scale sub-resource. It looks at the pods that are selected by the label, gets the current scale, and when it has a recommendation, when it thinks we need more, it just updates that scale. And if this is, for example, pointed at a deployment, that deployment will turn around and adjust the size of the replica set, and the replica set will create the pods. And then the scheduler will see these pods, and it will schedule them onto a node, and the node will see the pods, and the pods will be started. So the autoscaler is actually sort of just a controller around this one scale sub-resource knob, but ultimately the cluster creates two more nodes, or two more pods, because the autoscaler thinks that we need more. So here they are, and they get scheduled onto this node because there's room for it. So then maybe a little more traffic comes in for the purple service. Well, same thing with the HPA. We'll notice that we're a little above target. It'll decide we need two replicates instead of one, and it will adjust the deployment, and a pod will get created. But as you can see, there's not a node for this pod. The scheduler is unable to place it anywhere. And this is when cluster autoscaler kicks in. So cluster autoscaler responds to unscheduled pods, not like the underlying resource metrics. So cluster autoscaler will say, oh, I clearly need to scale up the cluster. And what it does is it runs a simulation, and it says, if I was able to, if I create another node, will this pod be able to be scheduled? And if the answer is yes, then that's the thing to do. And it can do this across multiple node pools, and it can do it, and more than one node at a time. So it actually has a little simulation of the scheduler, because it's ultimately up to the scheduler where these pods go. So the cluster autoscaler says, yes, we need another node. This sounds good. It creates the new node scheduler puts the pod on that node, and then the purple service has scaled out. So over time, VPA, which is paying attention to the actual resource usage as compared to the request, may notice that the purple service is asking for way more memory than it needs. So it's going to draw the histogram and figure out like a conservative place to draw the line. And it's going to recommend a smaller pod size. So the recommender makes this change. And then there's an updated process that comes through and will carefully delete the pods one by one and cause them to be recreated. So the updater is going to delete this pod. And a new pod will be recreated because our replica site will always try to replace the pods, and it will be recreated with a new smaller size. So this is how VPA actually causes pods to change their size. And you may notice that it's actually a little bit disruptive because you actually have to delete pods. A really important part about using auto scaling effectively is being able to control and manage that disruption. And so Guy is going to tell you in a little bit how you can do that, point to some tools and best practices for managing disruption. Then VPA updater is going to delete the next pod, and it will get recreated as well. And you'll notice that actually there's an empty node here now and cluster autoscaler will notice this too. It's able to delete that node and scale the cluster in. So you can see the cluster autoscaler goes out and in. Usually these nodes are what you pay money for. So it's good that it can scale down because that's what's actually going to cost. That's what's going to save you money. Now, suppose that maybe at the end of the day, the blue service loses some of its traffic, and the HPA decides to naturally scale down because utilization is really low, some more pods will be deleted. They're not going to get rescheduled because they just don't need them anymore. So you may be in a situation where you can actually fit more on the nodes that you have. And cluster autoscaler will notice this, and it will go and it will delete a pod so that it can be rescheduled onto the existing nodes. In effect, it will do some defragmentation for you. So again, it can free up nodes and scale your cluster in. So this is kind of a quick view of the way that vertical and horizontal autoscaling go in and out. The cluster goes out and back in again. And yeah, that's the summary of the problem space. So this is the end state of your cluster. And I wanted to mention a couple upcoming features. So I talked a little bit about CPU utilization or memory utilization. When you use HPA or VPA, you point at a target. You point at one of these resource metrics and you say, I'd like you to shoot for this. Now, there can be more than one container inside of the pod. And an upcoming feature for HPA allows you to set individual targets for each container. For example, if you have a sidecar that likes to use a lot of CPU, you don't necessarily need to scale on that sidecar and it may overwhelm your calculations. Or if you have a sidecar that has a very generous resource request and it's mostly idle, it may water down your metrics. And so take a look at this cap here. It's already gone through API review and it's just going to land very soon. And it's a nice way to focus the HP on specifically which containers are most important to you in your auto scaling. And the second thing that we're working on in the Sig Auto Scaling Group is graduating the V2 Beta2 API to stable. It's been around for quite a few years and it definitely needs to get graduated. All of the ability to scale on multiple metrics and custom metrics are a part of the V2 API. So they've been around for a while. So we're going to work on making sure, taking all the boxes and making sure we can get it to a stable state. So that's a little bit about auto scaling. I'm going to turn it over to Guy now who will walk you through some practical tips about running Kubernetes Auto Scaling in production. Thanks. So just for a bit of context here, I'm a senior software engineer at Skyscanner. So a lot of what we do is running workloads that are involved in either getting prices from online travel agents or airlines and then doing felt training and that sort of thing. This will come into play in a couple of the examples I give for things that we should consider. So I think us, a lot of people watching will know, a lot of the hardest problems in computing are technical problems. They're people problems. And this is one of the spaces where the complexity of all the options and all the tunable parameters and knobs that Kubernetes gives developers and cluster admins can overwhelm developers, especially when they're first getting started. And I think a lot of the time developers can not know exactly what options they need to consider when looking to scale workloads safely and perform in a performant way. So I'm going to go over a few of these options that I think developers should at least be aware of and potentially if your cluster admins providing your developers same defaults to work from and potentially override if your developers need to or want to. So the first one is lifecycle hooks. So in our, in Sky Scanners case, we occasionally have some very long-running requests for lots to fetch prices or reach out to partners who we'll make bookings with. And in that case, we don't want to drop an inflight connection, especially if it's making a booking. So instead what we want to do is set up our pods such that when they're terminated by scaling action, wherever possible, they gracefully terminate and finish the work that they're currently doing before shutting down. So this is where pre-stop lifecycle hooks can come in. So if you define a pre-stop lifecycle hook, which can either be an HV request or an exec, so a command into the pod to call a binary, for instance, the the, the Kubelet when terminating that pod will execute that lifecycle hook at the same time as sending in a SIG term. So you can either set up your application to handle the SIG term gracefully, or instead you can make use of say an API endpoint that can be hit by the the lifecycle hook. Use that to gracefully finish any work and refuse new work potentially and then allow the Kubelet to shut it down. So these pre-stop hooks, they're blocking and will block up to a grace period. So if you need more than the default 30 seconds to gracefully terminate, you'll need to set this up in your pod spec, but by default, it'll give you 30 seconds for that command to complete. You can also make use of what are called post start hooks, and again these can be HV or execs into the pod. They're not guaranteed to run before the container's entry point, but if you need to do some sort of environment-dependent setup work, you can make use of those. And the next thing is lab-ness and readiness probes. So these are fairly key if your workloads, for instance, on startup need to do some sort of runtime compilation or something similar to ensure that you're giving your users the best possible experience. And so in this case you can use lizeness probes to mark your pod as alive and healthy so that the Kubelet won't restart the pod. However you can use readiness probes to state that the pod is currently not ready to receive traffic. And that can be particularly useful in combination with the HPA. So the HPA, when a pod is not marked ready, so if it's doing some CPU-intensive work to begin with, the HPA will not consider that pod's metrics until it is marked ready. So that ensures that the HPA is not seeing new pods come up, use a lot of CPU, and then going, oh, I need to overscale again. The next thing is pod annotation. So in the case that you might have, say, a machine learning workload where if you interrupt that pod via scaling or the cluster autoscaler is trying to bin pack, you might end up with the workload interrupted and effectively losing work or data. In this case you can mark pods with this cluster autoscaler safety evict annotation. And if you mark that as safety evict equals false, effectively when the cluster autoscaler is considering bin packing it will notice that that node has a pod on it which is not safety evict and therefore not consider it as a candidate for scale down. So theoretically you can spin up workloads which are marked as safety evict equals false and then either update that annotation or remove it to safety evict equals true at the point where they are then safe to evict again to ensure that your workloads are not interrupted by the cluster autoscaler. It's important to note that this doesn't prevent any interruptions so hardware failure or manual deletion of nodes will still impact that pod but the cluster autoscaler won't interrupt it. And vice versa you can also use that annotation as set true to override some other behaviors which generally the cluster autoscaler will mark pods as not safe to evict. So things like if a pod is using local storage the cluster autoscaler by default will consider that pod as being unsafe to evict because it doesn't know what data it's right into the local disk it doesn't want to cause data loss therefore it's it will not try and move that pod around. However if you can tolerate the loss of that data then you can mark it if yourself as safe to evict in the cluster autoscaler would take that into consideration. So pod disruption budgets are probably one of the most important things so going back to Joseph's example of the VPA where it was recreating pods and due to changing the resources given to them the cluster autoscaler and VPA both respect the pod disruption budget. So this is a really granular way of service owners being able to define how much disruption their workload can tolerate. So if a given service can tolerate say 20 percent of its pods at any one time being moved around and being unavailable without harming the user experience the pod disruption budget is the way of defining that and then the cluster autoscaler and vertical pod autoscaler will take that disruption into account and therefore never perform a disruptive action which would cause that budget to be exceeded. It's important to note that this is this is the way to define this so that other components respect how services want to be moved around. Deployment rollout strategies which are a sort of more coarse setting that people may be more familiar with those aren't respected by these tools because they're that's not what they're set up to look at. And finally for this side pod priorities so this this allows you to decide which workloads are or should be prioritized when your cluster currently doesn't have enough capacity for all the pods so if we think back to Joseph's example again where one of the services was scaled up and there was not enough room in the cluster for all of those pods and if instead one of those workloads was the the workload that had been scaled up was of higher priority than the than some of the existing pods some of those existing pods if they could have been evicted and higher priority pods being scheduled those would have been evicted so you can effectively say if if there's not currently enough room on the cluster I want to prioritize this user impacting workload and I'm okay with this background task workload which I've marked as lower priority being evicted for a period of time the cluster autoscaler would still then kick in once those lower priority pods were unschedulable and scale up the cluster so that they could be scheduled again but it allows you to get the higher priority workload scheduled faster so horizontal pod autoscaling and so there's a few things here especially around making sure that you do optimize your scaling with respect to cost so potentially you could have service owner misconfigure their horizontal pod autoscaling rules maybe a metric that's that's never quite reached a target utilization which can lead to runaway scaling one way of handling this is namespace level resource quotas so this allows you to effectively say in a number of dimensions so pod cpu and memory what is the maximum resources this namespace can ever make use of so never let this never let pods in this namespace take up more than a hundred cpu cores say and that that will the horizontal pod autoscaler will still work it will try and increase the desired capacity of whatever its target is but the resource quota will then prevent new pods which would cause that resource quota to be exceeded from being created so it's really important for preventing any runaway scaling potentially costing you hundreds or even thousands of dollars and we then have so metrics sources for non resource metric based autoscaling so this is being covered in great detail by previous sig autoscaling kube contours but you can scale as well as on resource metrics on custom or external metrics so they in this case a lot of cases there will be a better metric than say cpu or memory utilization for scaling on so you could that could be number of hdp requests in flight or something similar and that's that's a way of optimizing the scaling far far more than cpu for instance and we've also got pod affinities so if you want to spread your pods out or schedule a workload together for best performance then potentially you need to set up pod affinities so you can either make these two nodes or pod based so that you can try and spread pods out to minimize fault tolerance and there are the pod affinities however are quite coarse and there's a new concept coming in and that is slightly better for these as Joseph already mentioned beware of the current behavior on resource metrics is the fact that they're across the entire pod and it's particularly an issue if for instance you're injecting side cars like you've got a service mesh or you're injecting low game containers or something like that it's particularly if you've got small containers and you're running say SDO you're injecting an SDO in boyside car if that's doing a lot of work you can end up causing unintended scale up particularly if your developers aren't worried about these and finally be the more fine grained version of pod affinities to apologize spread constraints so this is only in beta as of 118 so I'm guessing a lot of people won't have had a chance to play around with it however it's implemented as a scheduling plugin so it allows you to define either hard or soft scheduling constraints to scheduling pods over different failure zones whether that's a host or a rack or an availability zone and the cloud and it provides more flexibility than pod affinities and anti affinities and it allows you to try and balance your pods as best you can over as many fault zones as you can so for an example of this you can see this graph here every color represents a different instance type and this is one service in Skyscanner and you can see despite the fact that Skyscanner currently support over 20 instance types in our clusters over 50% of this services pods are scheduled on just two instance types increasing our potential impact of a loss of one or both of those instance types due to spot outage or something similar and this is without using spread constraints we'd really like to make use of them to obviously limit behavior like this going forward vertical auto scaling so memory-based vertical auto scaling isn't suitable for every language I'm being a bit facetious here in that jvm languages can now be tuned through jvm parameters etc to give Kubernetes a far better view into what the actual memory usage of a pod is however it takes a lot of expertise so it's not it's not simple I would suggest it's probably not worth heading down that rabbit hole unless you really know what you're doing in terms of like tuning jvm heat parameters etc but it can be done most of the time I would suggest it's probably not worth doing though you can also try combining horizontal and vertical pod auto scaling on the same metrics say so if you try say horizontal horizontal and vertical pod auto scaling on memory on the same workload this will generally lead to unintended behavior the two will end up fighting with each other and generally one or both will end up not understanding what has gone on due to resources changing underneath it and so generally you can you can combine horizontal and vertical pod auto scaling on different metrics but don't don't try doing both of them on the same metric at the same time and finally ensure resource policies are set and so this is this is taking up the same role as the resource quotas we're taking with the cluster auto scale with the horizontal pod auto scaling so and so this is preventing runaway scaling by saying this is the maximum size I have ever want a certain container within this target to scale to so you can tell make sure that the vertical pod auto scaling never gets into a situation where it's runaway scaling up and a given container or a given pod to the point where it's you're getting ever bigger nodes and costing yourself money again so cluster finally cluster auto scaling and so the the number one tip I have is to prioritize node startup time so this not only has the benefit of your cluster scaling up faster pods being able to be scheduled from unschedulable faster and it also allows you to tune a number of parameters from their defaults on the cluster auto scaler so that you're more tolerant to the cluster auto scaler realizing that nodes have developed a fault whether that's due to kubelet dying or hardware failure or something similar and also be aware that occasionally the simulation that Joseph mentioned inside cluster auto scaler so when it goes if I bring up a new node of this node group what will it look like what will its resources be and occasionally that doesn't always match reality you can occasionally see instances in the cloud where for instance the memory varies slightly from node to node and potentially the cluster auto scaler will think that it can schedule a new pod on a new node in a given node group but that doesn't actually match when the new node is brought up and finally and always ensure the system and kubelet have enough resources on the nodes so you can set up how much resources are reserved for the kubelet in the system to make sure that they are operating however for instance you're bringing up really big nodes and you're scheduling a load of pods on them that are trying to do a load of cpu work at startup time you could if you've not set these high enough you can end up either starving the system or the kubelet result in effectively killing off the kubelet because of resource starvation at which point all the pods that have just been scheduled and done cpu work on that node become unready get potentially are killed try to be scheduled onto new pod new nodes end up in a cycle of doom and giving you a very bad afternoon as slowly you've kill more more of the cluster but due to resource starvation that was pretty much it in terms of sort of best practice tips and we've got a couple of links to previous self scaling talks so the kubecon EU earlier in the year there's some good stuff there on custom metrics and other bits and we've got a couple of links as well to some nice blog posts about best practices for autoscaling thank you very much and any questions hello uh joseph brunette here i'm here for some live q&a and we've gotten quite a few good questions already in the chat and i tried to answer as many as we could me and guy have both been going through there and answering questions but i'll take a few now let's see here i think this one's answered sorry just give me a moment so there was a question here we would like to scale based on custom metrics our team needs to be able to scale based on the Kafka leg as mentioned several times right now you have to over give it we have to over provision our coffee in the resource metrics and so you need to install an adapter in your cluster which will be able to fetch those metrics in for example in GKE you can install a stack driver adapter and so or so wherever though that Kafka leg is you can install an adapter to get to it and set it up as an extra metric you can target an average value and just say for whatever the leg is we would like you know this many pods for this period this number of undelivered messages or you know take a look at those external metric docs and that should be something that can help it's definitely possible to do and if you have questions about it stop by the slack channel autoscaling slack channel on the Kubernetes workspace okay let me go on to another question here hopefully that answered that so uh Carl asked a question if a target raw value is set the raw metric values are used directly the controller then takes the mean of utilization of the raw value could you please elaborate so it sounds like Carl's talking about external metrics and maybe I'll take a moment to clarify something that I think is kind of confusing about the API when you specify a target uh for external metrics you can say value or average value and these two behave differently when you say I would like to target this value for this external metric the hpa I default every 15 seconds we'll take a look at where the metric is now and where you want to be and it will it will scale out or scale in to try to achieve that and it will keep doing that so if the metric doesn't change it'll keep scaling up up up up or down down down whichever way you need to go in a feedback loop however if you set average value not value but average value then instead it actually reconciles directly to the metric without regard to how many pods there are currently so if for example you say I would like at average value of five pods your five undelivered messages per pod so you're targeting an average value of five on an undelivered message queue then you won't get more pods if the value doesn't change you'll just keep reconciling to the same number um so I hope that answers that question it's a little bit subtle but it could be clarified by like maybe looking at the examples so if you're reconciling to undelivered messages average value is the one that you want okay um guy were there any that you wanted to pick out an answer um I was just having a quick look um I think there was there was one uh about whether there were recommendations about adding cluster auto scale annotations to pods um the way um I've achieved this in the past is giving the um the pods in a deployment themselves um the a service account and allowing that service account to um add and remove annotations to those those pods the effect is like they're only allowed to modify themselves and that way they can implement that um but I think that's pretty much out of time for answering questions on the call um however we're going to continue trying to answer any questions people may have had or we haven't managed to get to on um flat channel um to dash kubicon dash maintainer so um if you've got any questions that we didn't answer feel free to post them over there and we will try and get to them thank you very much