 This is the SIG Auto Scaling Update, and we're going to tell you about a lot of exciting things that we've been doing since the last time we all got together. Take my mask off. OK. Why don't I guess we'll go through some introductions to begin with. My name is Michael McEwn. I work for Red Hat. I'm an engineer. We're all engineers who work in the SIG Auto Scaling community. This is Joaquin Bartosik from Google, Guy Templeton from Sky Scanner, and David Morrison from Airbnb. And so when and if we can get the slides going, we're going to talk to you about some changes that have been happening in the Horizontal Pod Auto Scaler. We've got an update about what's going on with the V2 API that's been released and the deprecation notices around V2 beta 1 and V2 beta 2. David's going to talk a little bit about the GRPC additions that have come into the cluster auto scaler recently and in specific about the GRPC expander that's been written. And we also have some more GRPC kind of provider coming as well. I mean, it's already there. And then Joaquin's going to talk about the changes that have been happening in the Vertical Pod Auto Scaler. And Guy's going to bring it home and talk a little bit about the community and how you all can get involved and give us bug reports or pull requests or tell us our documentation stinks. And we should make it better. We need more testing. So I don't know, how are we doing on the tech situation here? I guess, OK. Does anybody know any good jokes? We just can't win for losing here, you know? All right, well, so as many of you may or may not know, for several releases now, the horizontal pod auto scaler community has been working to release a V2 API. This work has been ongoing, I think, since before 1.22. It started off with V2 beta 1, migrated to a V2 beta 2 status. And just recently, we've merged the PR to make the V2 stable. And we've put out the deprecation notices for the previous versions. And so if you're using V2 beta 1, and I imagine probably most people are not, or if you're using V2 beta 2, you should be aware that we're coming up on 1.25, which will be the end of life for V2 beta 1. V2 beta 2 is also in deprecation right now. But I think it will exist until 1.26, and then it will be deprecated as well. Now, for the most part, if you're using the horizontal pod auto scaler and you've been using the V2 beta 1 or V2 beta 2 APIs, you won't have to change much. The serialization format has not changed from V2 beta 1 to V2. There have been some changes to the programmatic API interface. So if you're writing code that uses the HPA as a library, you'll probably want to look at the release notes on the PR and on the HPA so that you can know. There are a couple of functions, I think, that change name. That was all it looked like to me. But for the most part, it should be a seamless transition from V2 beta 1 or V2 beta 2 into the V2 API. Let's just see if this is going to end. So yeah, so if you're using horizontal pod auto scaler, you shouldn't really have to change too much. Oh, thanks. Yeah, that actually helps. Let's go back to it. Yeah, so no changes to the serialization format. And for all of you who are taking notes out there, if you're really curious about this, you want to go to the Kubernetes, Kubernetes repo, and look for pull request 102534. And that's where the merge happened. You can read the release notes there. Now, that's pretty much the end of what I had to say. And unfortunately for you all, I only had a couple slides with a little bit of text on them. David and Joakim have a lot of slides with more text on them and graphics. And I'm going to hand it over to David soon. But you're going to have to kind of imagine what he's talking about until, well, maybe? Just in time. So with that, I will hand it over to David to talk about, oh, come on, don't. If we have to present it for you, we'll just keep going. OK. It's right. Don't touch. So I'm going to hand it over to David. He's going to tell you all about cluster auto scaler and GRPC and all that cool stuff. So take it away there. Thanks. I was actually, like he said, my slides have a lot of text on them, which is not my normal deal. So I was kind of excited to not have to show you walls of text. But here we are. I'm David Morrison. I'm a staff software engineer at Airbnb. I work on the Compute Infrastructure team. And specifically, I do a lot with scheduling and auto scaling and cluster efficiency for all of our Kubernetes clusters. So today, I want to talk to you about the Custom Expander interface that we contributed to cluster auto scaler. A bunch of this work was actually done by one of my colleagues, Evan Cheng, who unfortunately wasn't able to be here, but joint effort. All of us, say, auto scaling folks, it's all good. So yeah, let's crack on. So just as a quick reminder, I want to talk about what is an expander. So this is a code snippet from the cluster auto scaler code. This is the scale up function. So first thing that cluster auto scaler does is it looks for all of the unschedulable pods, things that can't fit anywhere on the cluster. And then it looks at all of the different node groups that it has available. So a node group, if you're using AWS, for example, might be an auto scaling group. Just as a reminder, a cluster auto scaler, one of the requirements is that all of the nodes in a node group have to be identical from an auto scaling perspective. So they all have to have the same amount of CPU, same resources. If you're using things like pod topology spread, they all have to be in the same AZ, et cetera. Then what happens is it takes all of the node groups that can accommodate the unschedulable pods, and it passes it off to the expander. That's this line that's highlighted in blue right here. It calls this best option function. The expander does its thing. It computes one or more node groups that it wants to scale up and then hands that back to cluster auto scaler to actually do the hard work of adding new nodes. So we're going to be focusing today on the best option function here. So again, just as a reminder, what are the types of expanders that cluster auto scaler has available? The default is random. It does what it says on the tin. Pics a node group at random. Scales it up. Most pods and leased wastes are kind of complementary to each other. Most pods tries to pick a node group that will schedule the most unschedulable pods. Leased waste kind of does the converse. It picks a node group that will, once you scale it up, will have the sort of least unused resources, CPU, memory, et cetera. If you're using GKE, you can pick the cheapest node group, which is great if you're on GKE. And then the one that we've been using at Airbnb up till now is the priority expander. I think this is the one that a lot of people use in practice. You can specify a prioritized list of your node groups, and the expander will pick the node group that is the highest priority that can accommodate your unschedulable pods. So the one we're going to talk about today, the one that's new, is the custom gRPC expander. So let's talk a little bit about why we wanted to write a new one of these things. We were looking at what is available previously in Cluster Auto Scalar, and none of them quite did what we wanted to. We wanted to be able to sort of dynamically change things on the fly, and you can kind of do that with the priority one. You can update the priority config map, and it'll take that into account. But we really wanted to have more complicated logic here. So then we talked a little bit about, well, what if we just built a new expander and contributed upstream? We really don't want to be running a fork of Cluster Auto Scalar. We'd like to stay with what's sort of available to everyone else. But we couldn't really come up with something that was both going to solve our business needs and would also be appropriate to upstream. And so we finally settled on this gRPC expander. So this has a couple of benefits. The first is that it allows us to encode our business-specific scaling logic. So it allows us to do things like take into account our AWS contract, which might change as we renegotiate things. We're looking at trying to run more spot instances. And spot depends a lot on the time of day. Depends on the price. Depends on what else is running in our cluster. And so it's kind of hard to figure out how to generalize that. And then the other thing that we were concerned about is Cluster Auto Scalar releases, roughly in lockstep, with mainline Kubernetes. And we needed some more flexibility. So if our traffic patterns change, or if our contract changes, or anything changes, we wanted to have the ability to update our scaling logic. And so what we settled on is let's build a interface that allows us to sort of encapsulate all of our business-specific logic as a separate service. And it'll just talk to Cluster Auto Scalar over GRPC. So that's what we did. This is the one and only diagram in my section of the talk. So I hope you like it. On the left, you can see you've got a node in your cluster. Cluster Auto Scalar is running on it somewhere. And what we built is inside Cluster Auto Scalar, there's a GRPC client that conforms to the expander interface. So it's got that best options call. And what best options does is it translates the expander parameters into a protobuf, passes that over the network to some other service that's sitting somewhere else in your cluster that acts as a GRPC server. That service will then take all of the options that Cluster Auto Scalar provided to it, and it will do whatever business logic you want. And then it returns the choice back to Cluster Auto Scalar, where the GRPC client will then translate it back into Cluster Auto Scalar lingo, and it goes on from there. Let's take a look at just briefly what the interface looks like here. So this is the protobuf interface. It's pretty straightforward. So the function, it has one RPC. It's best options. It takes a best options request, returns a best options response. The request has all of the node group options that are available. And then it takes this map, this node info map, which describes what the nodes look like in each one of those options. And then the response just returns one or more node groups that it would like to scale up. So here's some really simple expander code. The goal here is kind of like, I want to show you what it looks like to write one of these things. It's not actually that hard. So you do your standard boilerplate to set up your GRPC server. You create a new expander interface. You register it. You start listening for requests from Cluster Autoscaler. And then here is an example of what the best options function inside your custom expander might look like. This is a really dumb best options function. I don't really recommend that you actually use this. This just picks the node group that has the longest name. So I guess that might be useful for somebody. But it's pretty straightforward. It takes in the response, or it takes in the request, returns a response, and you're done. The only other sort of tricky thing, and it's not actually that tricky, is configuring Cluster Autoscaler to talk to your expander service. This is pretty straightforward. You just have to pass in three command line arguments to your Cluster Autoscaler invocation. So the first one, dash dash expander. We actually, another feature that landed in CA fairly recently is you can specify multiple expanders. We recommend doing this when you're doing this custom expander. So here we're saying GRPC is your first expander. And then if something goes wrong, if there's a network partition, your expander service crashes for who knows whatever reason, then it'll fall back onto the priority expander. So definitely worth having a fall back there. And then the other two are just telling Cluster Autoscaler how to talk to the expander. So GRPC expander URL says, hey, this is where you need to send your requests. And then the second one, GRPC expander cert, says these are the TLS certificates that you need to use to encrypt that communication. So it's pretty straightforward. If you've done anything else like there's the GRPC provider, has a similar sort of pattern. If you've done anything with admission controllers and Kubernetes, it's all the same sort of pattern. So nothing too exciting or complicated here. So I'm just going to finish up with a bunch of links. You can download this from the SCED site. First is a link to our design proposal. The second is a link to the actual pull request that got upstreamed. Inside there, there's a readme. That's just a sort of text version of what I just said. And there's also some example code in there. So if this is something that's interested, that's a good jumping off point. Sorry, if this is something you're interested in, that example code is a good jumping off point for you. I don't think this has gone live yet. But we have written a blog post about the expander work that we've done, as well as a few of the other contributions we've made to Cluster Autoscaler. I think that's going to go live on our blog either today or tomorrow. Not sure exactly when. I will update the slides once that blog post is live. For right now, that's just a generic link to our engineering blog. But hopefully you all can read that and get some interesting information out of there as well. So that's all I have to say about expanderers. I'm going to hand it off now to Joachim to talk about VPA. Thanks. So I'm going to give you updates about VPA. I have a few things to talk about. So first, I'll be talking about what VPA does. Then I'll talk about enhancements, which we introduced to VPA. And finally, I'll talk a little bit about our releases. So what does VPA do? We do three things. And corresponding to that, we have three components. And we have three modes of operation. So first thing we always do when you create VPA object is to generate recommendations and record those recommendations in the VPA object. And component responsible for that is recommender. Second thing we do is that you might want to apply those recommendations. For that, we have admission controller, which applies recommendations to pods when they start. And it operates only in auto and initial modes. So we can turn this off. And finally, we have updater component, which evicts your pods so that when they are recreated by controller, admission controller can apply recommendations if you want. So you can turn that on by choosing auto or recreate mode. If you want your recommendations applied automatically. Now on to the first improvement. We made alternative recommender support. So why did we introduce this enhancement? Because it's very hard to write an algorithm that will generate very good recommendations for all possible workloads. For example, many workloads have weekly usage pattern. And for those workloads, that default eight-day window of data, which we're looking at, will be enough. But different workloads might have longer patterns, for example, monthly. And for that, eight-day window will be too short. But on the other hand, we don't want to extend the window to be very long, because then our reaction would be slow. So it would be good to choose different windows for different workloads. Similarly, workloads which answer user queries might want to be able to react very quickly, even if there is a load spike. On the other hand, some workloads don't need to react very quickly to spikes and can spread out their work and conserve resources. So to allow that, we allow you to choose from multiple recommenders now. So you start multiple recommenders, and then you can set choose, which one to use in your VPA object like that, by simply specifying the name of the recommender which you want to use for this VPA object. So like you can see here, multiple recommenders would read the object, but only one would actually write the recommendation. But to choose from multiple recommenders, you need to actually run multiple recommenders in your cluster. And to do that, right now, you have to implement your own recommender. I hope that this will change, but for now, we have to write some code. And when you do, it's very important to remember that only one recommender should recognize any name. Because if two recommenders try to write to the same VPA object, then the recommendation will change very quickly, and it might not work very well. On the other hand, one recommender might recognize multiple names. So for example, the default recommender writes recommendation either if there is no recommender name specified, because that was the behavior before, or if you specified that the default recommender should write the recommendation explicitly. OK, the other enhancement we introduced is per VPA object min replicas. So by default, VPA will evict your pods only if there are at least two running pods in your controller. We do that because if there is only one pod running, then it's pretty likely we will disrupt your workload if we evict the only pod. But it's not a good behavior for everyone, because if you have only one running pod in your controller by design, then we will never apply recommendations, and this might not be what you want. Before, there was an option you could set to change this behavior for all VPA objects in your cluster. But again, this might not be what you want, because you might be running different workloads. For some workloads, you might want to wait for multiple pods to appear, so that at least some will be working. For other workloads, you might want to evict the only pod to apply recommendations. And the other problem was that sometimes you're not controlling your cluster, somebody else is, and then you couldn't set the option. So this is changed to an updater component from before. And again, using this is pretty simple. This time you don't even have to write any code. You just specify how many pods you want running in your controller before we start evicting them, and it will work. Finally, we used to do releases ad hoc, and this resulted in some pretty long time durations between our releases. Going forward, we would like to do releases every time there is a Kubernetes release, so three times a year. And if you would like to learn more, I linked to Caps about the enhancements I talked about. Also, if you want to help, we have some ideas. I would like to make it possible to configure pass parameters to recommenders when you specify them. It would be nice to make it possible to expose multiple recommendations in one VP object so that you can preview what different recommender recommends before you actually make the switch. And it would be good to make the default recommender more configurable so that you can use it as a multiple recommender by writing multiple instances without writing your own code. And with that, I'm handing over. Thank you. So you've heard a lot about the new features, new extensibility that we've shipped over the last year and a bit. However, this is where we need your help. You've heard, obviously, ideas for future improvements to the VPA, cluster autoscaler is improving its extensibility. People can write their own GRPC expanders now. So we need your help. We own a lot of projects, like you've heard about the HPA, you've heard about the cluster autoscaler, VPA. We only have a small number of maintainers. We can't do all that we want to do just with the number of engineers we currently have and contributors. So we want your feature requests, implementation, bug reports, what's not working for you, what you would like to see work better. And we also need your help with bug triage and response. We don't have enough contributors at the moment to triage all the bugs that we get reported. We'd really love to improve that, as well as infrastructure improvements to improve the testing around all of our components to make it easier for US end users to contribute. And we definitely want to expand our maintainers as well. At the moment, we know that there's different, for instance, in the cluster autoscaler, despite the extensibility now with GRPC expanders and GRPC potentially cloud providers. At the moment, most of the big cloud providers are baked into cluster autoscaler. So we need maintainers for those cloud provider implementations. A lot of the big cloud providers do support that, but we need more to improve the responsiveness, improve how often we're releasing cluster autoscaler as well. And as you already heard, we want to improve the extensibility so that you're not dependent on us cutting releases, but you can, if you want to bake in business logic, whether it be an expander or a cloud provider implementation, that you can do that out of band as well. We've got links here to our community charter, if you want to get involved. We've also got a mailing list. That's not massively active, however, we also have. That also enables you to edit our officers, put things on the agenda. And we're also fairly active on the Sig Auto Scaling channel. There's also a Sig Auto Scaling API channel for discussion about things like the HPA, V2 API, and potential other future improvements to the API side of things. And we'd love to have you involved. Any questions? Stone style. This is awesome. Say that out loud so everyone can hear that, sorry. So yeah, my question was you can use like VPA and HPA, but you have limitations due to the VPA. So I was just wondering if there's like some roadmap on that, some dates. So there is some work on making VPA and HPA work together, but it's very early as well. So I wouldn't wait, hold breath for it. Any other questions? Thank you for a great presentation. The cluster outscaler now has the custom expanders. It's that pattern you're looking into for the VPA as well. So I haven't thought about it yet. But if you have ideas, you know, feel free to come to see, can talk about this. Who else? The main replica feature in the VPA, the new one is for disruption purposes, right? And if it is, can't we just use the both disruption budget for that value? Okay, so VPA also respects both disruption budget, but there are extra measures in VPA. So if there is only one pod running in your controller, VPA wouldn't evict this only pod, even if part disruption budget would allow it. Okay, does this answer your question? Okay, I mean, if I have the both disruption budget like three, and so the only way where the benefit if it is only one. Can we get a little community effort here? Yes, so if you have pod disruption budget and you want pod disruption budget says you want at least three pods running, then this lag will not change anything for you. Like this is extra check we do in VPA in addition to really following whatever restrictions pod disruption budget sets, right? So you have both, but you could have pod disruption budget that allows to evict all the pods. But even then before this change VPA wouldn't do that because we had an extra check inside VPA and we would say, hey, there is only one pod and we wouldn't even try to evict, okay? So we have a couple online questions as well. And the most voted one right now says is there a predictive auto scaling algorithm? And in my opinion, predictive auto scaling is a very complex topic. And we do not have anything like that currently, but does anyone else wanna talk to predictive auto scaling? It involves a lot of machine learning and a lot of mistakes and those kind of things. So no, we do not have a predictive auto scaler included in any of this stuff right now. I've seen many attempts over the years for people to try to use data science to analyze the usage patterns in their clusters to create models that they could then predict like how they're gonna auto scale. In my opinion, you'd probably be better off just looking at like the traffic logs for whatever you're doing. And it's like, well, we know that Friday at 9 p.m., we get hit. So we already know the prediction. Friday at 9 p.m., add some more nodes. To do things in a more dynamic way is a little more complicated. Did you wanna speak to that? Yeah, go ahead. I've actually tried to write one of those predictive auto scalers in a past life. I guess I'll say there's a reason that I've switched to using cluster auto scaler and just watching the number of unschedulable pods because it generally seems to work better, in my opinion. So. Cool, thanks. So there are a couple more questions here. One question is where can we find the slides? I don't think they're on the sketch page yet, but they are. Okay, so look on the sketch.org or sketch.com page for this presentation. You can download it. It's got all the links in there. There's one other really cryptic question, and maybe this is a good one to end on because it seems kind of maybe an existential question. It just says, what about node auto scaling? What about it? Yes. I'm not sure how deep to go. There is a cluster auto scaler and it does node auto scaling. Maybe this person was asking about some sort of like vertical node scaling or something. Something like Carpenter. Yeah, right, exactly. Something like Carpenter where you've got different auto scaling primitives and whatnot where it's selecting machines in a different way than the Kubernetes cluster auto scaler does. So what about node auto scaling? I think as a SIG, we think node auto scaling is pretty cool. We like it. We'd like to see more of it. We'd like to see the community coming and telling us what they want to see in it. Whoever made this question, if you have suggestions, open an issue on the Kubernetes slash auto scaler repo and yeah, maybe that's a good place to end it.