 So welcome to my talk, autoscaling Kubernetes deployments, a mostly practical guide. So hopefully this is the one you're actually meant to go to. Just a little bit about me before we start. I'm a software engineer at New Relic. I work on the Pixi open source project. Pixi is a CNCF sandbox project, and it basically provides observability for your Kubernetes cluster, and it's super cool. So please check it out. I love observability and performance problems, and prior to that, and partially because of that, I formerly worked in the data space, and what we see with observability is a lot of the same problems and observability are data problems. So that's the uniting thread behind some of the stuff I've worked on. So today's talk is all about autoscaling Kubernetes deployments. So first, we'll briefly touch on, what is Kubernetes autoscaling and why would you do it? Then we'll cover more detailed stuff like, what are the different knobs that Kubernetes autoscaling provides us with? You have to know the right bottleneck in your application in order to make your autoscaling as performant as possible. So then we'll cover selecting the right autoscaling metric for your application. Then finally, I love to push the boundaries of the technology I work in, and it turns out you can make a Turing complete autoscaler for Kubernetes. So we'll be showing that at the end. So let's get started. So when you're sizing an application or Kubernetes cluster, how do you decide how many nodes there should be? How do you decide how many pods should be in your deployment? How do you know how many resources to give a pod? These are not obvious questions, or these are obvious questions maybe, but they don't necessarily have obvious answers and the strategies that people take for them vary. For example, there is the methodology of doing a completely random guess and hope that it works. You copy pasta from some other thing that I've already deployed on your cluster. This is one of the most common ones in my experience. You might be one of those people that is always thinking ahead and you're always proactively iterating. You're monitoring your deployments and seeing, oh shoot, CPU is a little bit close to the limit. But mostly in practice, we reactively iterate. We see, oh shoot, my application is down. Oh my gosh, the pod, it's using all the CPU it's given. Incident, autoscaling is intended to help solve some of these problems, and Kubernetes provides really, really good support for autoscaling. The thing is that the ideal resource allocation completely depends on your workload, and these workloads are spiky and often unpredictable. Here we have a screenshot of a workload that we're running. You can see that the traffic can be over 10x at one point what it is at a different point. And a lot of times these spikes will bite you when you least expect it. It's not something that you can always plan on. If you have too few resources, you give a bad user experience. You have latency problems and even outages. This is a big no-no. But on the other hand, you can't just go allocate everything under the sun either. It's really expensive. Kubernetes provides three types of autoscaling. We'll give a quick overview of two of them and then focus most of our time on the third. First, there's the cluster autoscaler. This adds and removes nodes to your cluster based on resource utilization like CPU. Next, there's the vertical pod autoscaler. This adds resources like CPU and memory to your existing replicas or pods, also based on resource utilization. Finally, we have the horizontal pod autoscaler. This is going to be the one that we focus on most in this talk because it's extremely powerful. It adds additional replicas to your deployment, etc. It also can look at resource utilization, but the really cool thing is you can actually define your own metrics that you want to scale based off of. This makes it one of the most powerful features of Kubernetes in my opinion. Also, I know that the font might be, I try to blow it up where possible, but for the people in the back, the slides are available on the talk kind of sked thing, so if you want to follow along, I feel kind of bad if you can't see it. So let's quickly touch on the cluster autoscaler. I just wanted to give some quick tips for each of these scalers. So one thing that you have to know with a cluster autoscaler is you have to set the pod resource requests and limits. Otherwise, the cluster autoscaler can't really know how many nodes to add or take away. You also need to make sure that your resource requests reflect your actual usage. If I have a pod that's only using 10 percent CPU, and I've requested 100 percent CPU, the autoscaler is going to treat it like a 100 and not like a 10, potentially leading to unnecessary waste and lack of utilization. You also want to make sure you specify pod disruption budgets. This one's really important for the cluster autoscaler, because when the cluster autoscaler adds or removes nodes, it's going to reschedule the pods in your cluster. And if you have a high availability system or if you just have a pod that you need to make sure a different one's running before you move the second one over, the pod disruption budget is the best way to make sure that these workloads can be transitioned to the new nodes safely. You also need to make sure that you don't do what I had this morning, where you try to scale beyond the limit that your cloud provider is actually giving you. You don't want to say, oh, you can go up to 100 nodes and your cloud provider nerfs you at 10. This is something that's really important and can definitely cause an incident if you're not careful. One thing to note is that the Kubernetes contributors, in their documentation, they note that they've tested this cluster autoscaler for up to 1,000 nodes with 30 pods per node. So if you're running a mega workload that's bigger than this, you should keep that in mind when using the autoscaler. Next, let's look at the vertical pod autoscaler. By the way, there is this really hilarious TikTok that I saw for Kubernetes that was showing the difference between the vertical pod autoscaler and the horizontal pod autoscaler in terms of glasses of water. So definitely check that out if you're a TikTok user. So with the vertical pod autoscaler, you can still set a resource cap with the vertical pod autoscaler. So when it's adding something like CPU, you'll set a cap so it doesn't just grow the CPU indefinitely. You may see a container restart or your pod get rescheduled because it's adding more CPU. Maybe it needs to be moved to a different node to have that amount of CPU. You want to use the vertical pod autoscaler in conjunction with the cluster autoscaler because otherwise, you can have a situation where you're scaling up your deployments altogether needing many more resources than before. And you're also going to need more nodes to run those resources. And finally, you can't autoscale resources like CPU and pod for a given replica and add replicas with the horizontal pod autoscaler at the same time for the same metric on the same application. So whether you're going to make your existing pods bigger or add more pods for a given deployment, you're going to want to pick a lane and choose either the horizontal pod autoscaler or the vertical pod autoscaler. And then finally, I don't know the exact specifics, but they know in their documentation that this has not been tested on large clusters. So it's something that you'll always want to test on your own application. OK, finally, we'll talk about my favorite, the horizontal pod autoscaler. So this provides so much flexibility for metric selection. It's not even funny. And every application is its own special snowflake. So you can give it its own corresponding special snowflake metric. But you're going to want to check your service client affinity policies to ensure even low distribution. You don't want to scale up your application from two replicas to 100, only to discover that the first two are still getting all the traffic. That would be a complete waste of resources. Also, you need to make sure you set resource requests when you're scaling on CPU and memory. Horizontal pod autoscaler, when you're scaling on something like CPU, looks at the percentage of your request that you're utilizing. So if you don't have a request, it's never going to autoscale. And finally, as with the vertical pod autoscaler, we can't use these two in conjunction for the same workload on the same metric yet. But that is something that the contributors are working on. So let's get to it. So I have a demo of horizontal pod autoscaling on CPU. I heard there's been a little network connectivity, so I recorded it, which also allows us to watch it on fast forward. And this demo exists on the Pixi demo repo, so you can actually do it yourself and try autoscaling on various different metrics. And I'll try to narrate for the people in the back because I know the font's a little bit small. OK, cool. So what we have here is I'm actually showing a view in Pixi. It's just completely defined by a script that I've written. And I'm showing a view of my application before autoscaling is added. So just for the people in the back, the upper left quadrant is requests per second for my service. The upper right quadrant is the HTTP latency. And there's three lines because we have P50, P90, and P99 latencies. The bottom left, we have the CP usage by pod. And on the bottom right, we have the number of pods for the service. Since this is before autoscaling, it's stuck at 1. So we're going to use Pixi to observe what's happening in the cluster as we autoscale up. So we're going to watch this. And what we're going to do is use my favorite low-generating application called Hey. And it basically is going to allow me to spam a ton of traffic to this endpoint. And this endpoint's an expensive endpoint. It's very CPU-intensive. So that's why we're going to scale on CPU. So I'm spinning up 20 concurrent clients, each sending one query per second, which is a lot for this service. And in the bottom left, what we can see is that one replica is running right now for my Echo service. And the beauty of this is we can jump ahead and let the autoscaler do its thing. Oh, look. We have four new replicas. It's responded to the load. And now we know that this deployment needs more resources associated with it. So we can look in Pixi to see the spikes. Now the load has ended on the application. But we can see the request per second has gone up. The HTTP latency has gone up, too, because the autoscaling hasn't kicked in yet. CPU usage has gone up. These latencies are very problematic. So we're counting on the autoscaler to quickly adjust so that we can get to much more user-friendly latency. We'll jump forward again, and the load has ended. So what we can see is that it added three pods in response to the latency that we had. And it wasn't running for very long. But you can see that near the tail end of it, the latency went down a lot. So it did its job. If this had been a persistent spike, then we would have seen much better latencies as a result of this autoscaling. And we'll come back to this view later for other use cases. Oh yeah, so this is just a screenshot of what we saw. So you might be wondering, how is it that the horizontal pod autoscaler figures out how many replicas it needs to add? And it's a very simple equation. It basically takes the ratio of the current value for your metric that you've defined, like CPU or throughput, divided by the desired value for that metric. And then that's what it uses to determine the right number of replicas. So it's a lot more than just the metric. So what knobs do we have? We have lots of questions. Like how do we set the minimum and maximum number of replicas? How often do we look for changes in the metric? I might just have a super transient spike that I don't need to autoscale on. But transient to me might be different than transient to you. That might be something you want autoscale on. How quickly we want to add or remove pods? And each time we add or remove pods, do we want to cap the amount of change that we allow at any given period? So on the right, what we have is a chart of my metric. And then slightly lagging that is the number of pods. So when you cap the number of pods, you're basically squishing this curve. And I like to look at input and output waves because I come from the harbor space. So what we see here is that for the same metric, you can squash down the maximum number of pods and put a bottom on the limit as well. And this is good when you have resource constraints or you want to provide a certain guarantee that at least a certain number of replicas are running. You also might add a stabilization period, which basically says, I don't want to make changes too quickly. I want to wait before making a change to make sure that it's actually something I need to scale on. Now looking at this chart, you might say, hey, but this looks strictly worse than the one on the left. There is this period of time here where we actually don't have enough pods based on the metric that we've defined. And it might be wondering why would we even do this? The reason is it reduces the pod churn. So for example, consider a waveform like this where the metric is going up and down and up and down and up and down. If we were to do the naive thing here and just add pods as soon as we can, as many as we can, you're gonna be churning all of these pods again and again and again. It's not efficient. It would be much better to take almost like a low pass filter approach to this. You don't want your car to go telling you, every single time there's a bump. You want your car to provide a smooth experience. And in the same way, you want your auto scaler to be smooth as well. So if you were to set a stabilization period for this, you could actually keep the same number of pods instead of having it go back and forth. This would be better because it would be ready for the spike to happen again in the near future. It's a similar thing with capping the max pods to add or remove for a period. Once again, if you look right here, it would look as if this is a strictly worse outcome than the one on the left. Because you have a period of time where you don't actually have enough pods for the metric that you're looking at. So this is something that you can just configure in the auto scaler like anything else. It's all specified in the YAML. But consider a workload like this where you have a very brief, very transient spike in the metric. You do not need to spin up all of these pods just for this temporary spike. You probably want to react in a more limited way until you have more information that this is actually a persistent change in your workload. So that would be a use for capping the max step size both up and down. And you might set these differently. You might say, hey, I want to scale up really quickly, but I want to scale down slowly because I have trust issues that the actual spike has gone away. All right, now let's get into some more of the meat. Selecting an auto scaling metric for your application. So because the horizontal pod auto scaler is super, super powerful, it gives you lots of options for these metrics. So let's go over what they are. The first one is the one that's shared with the cluster auto scaler and the vertical pod auto scaler. It's built in. You don't have to do anything special to use it other than write a YAML file. Basically it says that you can easily scale on resources like CPU and memory that Kubernetes already has definitions for. Number two, and this is one that we have lots of great demos on and that demos repo I told you about is the custom metrics API. Now this is a user-defined metric, which means that you get to say what you want this metric to be. These are metrics about Kubernetes resources. So you would define a metric for a pod or a service or something like that. They have to be associated with a particular resource. I've made all kinds of these. You can do latency, you can do throughput, you can do the depth of your queue. Really you can do anything that you have defined as the critical thing to scale on in your application. Finally, we have external metrics. This is another user-defined one, but the difference here is that these do not associate with a particular Kubernetes resource. This might be a business metric, like the number of people that are using your application right now. There are so many possible bottlenecks in your application. CPU, obviously. Memory, duh. Network, people don't always think about this one as much, but I've seen it happen in practice. The number of worker threads, maybe you only have two worker threads at once and this pod can simply not take more than that. It's just the way it is. Maybe it's using the GPU or something like that. The number of outbound connections. It might not be the case that your deployment is the bottleneck. There might actually be a downstream dependency. It would be completely pointless to scale up on something like CPU when the thing isn't even CPU bound. Queued up as we mentioned, this is another big one that we see in practice a lot. There's so many more. The best metric depends on your workload. It's really important to characterize your workload before choosing a metric to autoscale on. And that's where using open source tools like Pixi can help. So let's look at an example application. So in the upper left, it's the same view as the one we showed the video of. Request per second. It's also showing errors per second in green. So what we're seeing for this application before autoscaling is the request per second is going up to about 25 when it's hitting load. And when that happens, we're seeing a five second latency. This is probably unacceptable. It might be fine for your workload, but let's just imagine it's the common case and this is unacceptable spike in latency. But the weird thing is we're seeing a low CPU. The CPU is hitting 8%. How is the latency so bad? That's because this is one of those workloads that actually has a queue latency. It puts items in a queue and then it processes them at a certain speed. The latency is not actually in the CPU or resources that you would expect. But after we autoscale on latency, we can see that we've added a certain number of pods. It's reached five and this has had a huge impact both on the request per second, which is now four X higher and on the latency, which starts off at that five second problematic zone, but then it drops to two, which is a huge improvement. And the CPU usage is stable as well. So this would be a good example of a case where you really have to think through the metric that you're autoscaling on. Let's take another example. This is a really problematic workload. It's maybe a little bit hard to see, but over 90% of these requests under load are errors. And the latency drops when that starts to happen. What's going on? For this workload, CPU is really high. So what's happening in this workload is that there is a certain capacity that it has and it starts turning away requests as soon as it reaches that capacity. So let's say I can only handle 10 things at once. I'm just going to send an error back out and tell you, too bad I'm full, go away. When that happens, the latency drops because it's actually really fast to say no, go away. It's a lot faster to do that than it is to actually do the work of the request. So if we were to scale on latency here, it would actually do the inverse of what we want. We don't want to scale on latency when latency is actually lower in our problematic state. So what can we do instead? We can actually scale using a custom metric for error rate. So this is something that we built out in the Pixie demos as well. You can actually look at the error rate in your application and provide a custom metric to Kubernetes based on that. So what we see here is that in the beginning, most of the requests are sending errors. It's saying I'm full, I can't take your capacity. But after we auto scale, it adds nine more replicas and we see a huge drop in the number of errors. We see the latency rise as well, but that's expected. More of these requests are actually being handled. Okay, so now for something totally deranged. So for those of us who are a little bit far away from our CS classes, let's review really quickly what a Turing machine is and what Turing completeness is before we get into it. A Turing machine is capable, it's a theoretical idea and it's capable of any computation given enough time and tape. What do I mean by tape? A Turing machine takes an input program and it does the computation and it writes an output value to theoretical tape. And something is a Turing machine when you can do any possible computation on that thing. So for example, if all I had was the ability to add one, that wouldn't be Turing complete because that instruction is not complex enough to do any arbitrary computation that I might wanna do. But there is something that is an instruction that would be Turing complete on its own. This is called sub-leck. Actually, I'm not sure how it's pronounced because I've only read about it. So if that's completely wrong, I apologize. But anyway, sub-leck is a one instruction set computer and it's sufficient for Turing completeness on its own. It may be very convoluted, but you can actually write any program you want using exclusively sub-leck. What does it do? What it does is it says, I'm gonna compare two values and subtract one from the other. If that is less than zero, I'm gonna jump to my next instruction provided in the command. Otherwise, I'm just gonna continue on with my program like normal. I'll leave the proof of why this is Turing complete to the mathematicians. So how are we gonna make a horizontal pot autoscaler Turing complete? What we're gonna do is have this horizontal pot autoscaler evaluate a different sub-leck command every interval that it's queried for its metrics. So we'll take a sub-leck program, we'll execute one instruction at a time, and then we'll set the number of output pods. And that output pods is the result of the computation. So you can think of the tape as the time series of the number of pods over time. And that's what the output value is. How do you get the input program? Well, I had to get it a little bit creative. This is my input program. We'll split on X. Now we have a series of numbers. There's three input arguments per command. So this is my result. So you can specify a deployment name like this and say, I wanna autoscale on this deployment, and that's how the autoscaler loads in the input program. Like I said, this is totally deranged. And just a way of demonstrating the power of the horizontal pot autoscaler. How do we set the certain number of output pods? What we do is we take this equation and turn it on its head. The autoscaler looks at the current number of pods and it backwards calculates what the metric value has to be to ensure a certain number of output pods. So let's take a look at it and see how it actually works. And I'll try to narrate it for the people on the back. And we're gonna put this at two X because that is the power of a video. So this is a completely open source thing. You can look at it and try it out yourself. So I've created my metrics API server. I have a deployment which has a funky name like we saw. And what we're gonna do is we're gonna watch the output number of pods over time and make sure it hits the values that we want. You might be wondering what this program does. What it does is it prints out high and ASCII, but it could do anything. So basically 72 is gonna correspond to the ASCII value for H, capital H. You can see next the autoscaler has computed a value of 105. It's happened really fast because I've sped it up. So you can see now it's hitting this value up here. Well it's actually 107 because I add two to every result so we never have a negative result. But the point is that we're seeing the number of pods here hit the result of this computation. So it started out 72 or 74 once you add two. 105 or 107 once you add two. And then the program is gonna terminate. So there you go, this autoscaler outputs high. Okay great, so this is kind of the end of the prepared remarks, but I just wanted to give some shout outs. To the Kubernetes SIG instrumentation, they've made it so easy to get started with defining your own custom metric for horizontal pod autoscaling. So definitely check them out if you wanna do that. Also check out Pixie, which I use for a lot of these demos. And then finally, thanks to Jan Adogan who made this awesome load generator. If you have questions raise your hand. Okay one over there. Oh I forgot to say also we have swag at the back for the Pixie open source project if you wanna pick it up. Hey there, thank you for your really great talk. I have a question. Did you use the vertical autoscaler also an open shift? Do we have experience of that? Using autoscaling with open shift? Yes. So I don't have direct experience using it with open shift but I assume that it would work just fine because it should support most Kubernetes features and this is a built-in Kubernetes feature. But if you wanna give it a try you can try that Pixie demos repo that I mentioned and you can just deploy it on your application and see if it works. You have a question? You didn't say anything about the limit. You said about the request and not the limits. I know limit is a bad usage. It's supposed to be deprecated I think but we need to say a few words so people won't use it. Yeah, that's a good point. Oh that was more of a statement than a question. Yeah. At the back where you're leaving could you keep the noise down? Some people are still trying to follow along. Thank you. Who's got a question? No, no more. Yeah, you need to hold your hand up, hi. Okay, so what's gonna happen when Altoscaler hits a limit of some resource on the cluster? So are you saying that it's targeted at a value above the limit of the cluster resources or when you're saying it hits the limit that you have set? The limit of the cluster resources. Yeah, so what you'll probably see in that scenario is that the pods, at least the horizontal pod Autoscaler will be pending. And so that's why it's really important to always check your auto scaling policies with the limits that you have especially with your cloud provider because you don't wanna get a bunch of pods stuck in a pending state. One more over here. Hi Natalie, thanks first for your great and structured talk. One question and lessons learned. So using the HPA with a sidecar proxy like we have one pod ending up having two containers like Istio. So the HPA is always using the average of the predefined metrics. So I could have Istio using 200% of the requested CPU and my application only 50. So I end up with a very messed up state. Do you have a lessons learned there or like best practices? Yeah, that's a really great point. I think that it really depends on what you wanna scale on. Like if you're trying to look at the joint utilization of the entire pod across containers then the built-in behavior can work just fine. But in your case, if you care a lot more about one of those than the other, you might wanna create a custom metric for the one that you care about because it's your way of specifying the exact thing to behave with. Okay, I'm gonna read out. Oh, I think there was one behind you actually. I'm still gonna read out a question from the online. A lot of people following along online and some of them have asked questions. Oh, great. So first one was could you reshare the demo URL? Maybe you could just put that slide back up. Oh, yeah. And the second one was is it better to use a small number of large pods or a large number of small pods? That is such a good question and there's not really one size fits all as an answer. It's something that really depends on your application. So what I would generally say is that if you have a lot of small pods that can be really good for a more stateless application and a more stateful application. You might prefer to size the pods bigger but that's just something that I've seen in practice. It's not a one size fits all. I think that the more that you can build your application to scale out proportionally to pods rather than having one mega pod that's more of a pet, it's more canonical to do it in more pods rather than just one mega pod. But there are some cases that you run into where you have a stateful application and you really do need a certain amount of heavy resources for that pod. Thank you, next question. Yeah, so you said the resource for HPA is important. So is it recommendable to mix together with the cluster autoscaler with HPA? Yes, those two can play together. The only ones that you have to be careful about are the vertical pod autoscaler with the horizontal pod autoscaler if you're scaling on the same metric. Hi, so do you have experiences or strategies for predictive autoscaling as well and not reactive on metrics? So what you're trying to say is like as opposed to responding to a change in metric that you've observed, you wanna actually predict that the metric will rise and then autoscale based on that. Yeah, it's a really interesting question. I think that probably the best practice for that today is try to create a custom metric that actually does that prediction and autoscale based on that. And that's something that the specifics of how to do that would probably depend on what you're predicting. But at the end of the day, a prediction is just another metric. Hi there, sorry if I missed this at the beginning, but how does this differ to using something like, say, KEDA to manage the autoscaling of pods? Because KEDA supports kind of custom metrics through Prometheus. So I was just wondering what advantages does this have over KEDA and vice versa? Yeah, for sure. So I think that what I've covered today is the stuff that's built into Kubernetes. So it's the stuff that if you're using Kubernetes, anyone can use. There's cloud providers with their own node scalers. There's KEDA also provides great autoscaling. So what I would say is, if you're just looking to use the Kubernetes API, you would use these things. But if you're already a user of something that provides autoscaling, you'd have to compare it against that and see which features meet your use case the best. And I realized that that's a non-answer, but it's the kind of thing that everyone's situation is very different. Okay, I think this is going to be the last one. Hi there. Is it possible to scale using multiple metrics or do you just have to choose one? It is possible to scale using multiple metrics. And what you can do then is say to the autoscaler, how do you want to decide which one wins? So you'd have to set a policy. And that policy would tell you when the two things disagree which one to defer to. Okay, thank you very much. Okay, I think we're at time. Thanks everyone.