 Thank you for turning up at nearly half five on the second day of the conference. We appreciate it. We are going to go through some of the SIG updates over the past six months in the last cubecon, as well as some feature highlights of improvement work that we have gone through. We will give a quick introduction to the SIG for those who are not familiar, the sub-projects that we are responsible for, what they do, the problems they cover, et cetera. We will cover the adoption of carpenter, one of the big things that's been happening over the past time since the last cubecon. Some updates on cluster auto scaler and also a quick update on VPA. Future work that the SIG would like to do, this is also your opportunity to get a vault in the SIG, and further areas where you can help them. In terms of the areas covered by the SIG, this is where SIG naming might get a bit confusing, because you might look at SIG autoscaling and then also SIG scalability and think what's the difference there. SIG autoscaling is primarily focused on scaling of clusters and nodes and workloads. We have a few different sub-projects that are intended to address these in different dimensions, different problem spaces. In terms of the scaling of clusters and nodes, that's primarily cluster autoscaler, has historically been the project that the SIG owns that covered this. We are also in the process of adopting carpenter. This is a project you might have heard mentioned if you were previously in the previous talk in this room. It's been talked about past cubecons as well. It's another project aiming to solve the problem of scaling clusters and nodes. In terms of scaling workloads, we've got horizontal scaling, vertical scaling, multi-dimensional scaling as well as a couple of other projects. The balancer and add-on resizer. I'll give you a quick overview of each of them. Cluster autoscaler, it's responsible for monitoring for unschedulable pods, provisions, nodes in response. I'll cover this in a bit more detail on how cluster autoscaler and carpenter differ when talking about the adoption of that. It also is responsible for removing underutilised nodes, respecting PDBs and constraints. Everyone's focused on making their clusters more sustainable, making their clusters cost less. That's how cluster autoscaler aims to help there. You have your workloads scaling down times a low traffic. Your nodes become underutilised and cluster autoscaler takes care of that for you. It performs scheduling simulations based on the declared unschedulable pod requests. It looks at all the unschedulable pods currently in cluster, figures out what resources all those pods will need to be satisfied, including constraints based on them, and then provisions the nodes to meet those requirements. Carpenter works quite similarly in some ways, different in others, but it also monitors for unschedulable pods, provisions, nodes in response. It's slightly different in the way it performs a scale down of underutilised nodes. It's also able to look at nodes where if it removed all the pods onto other, a different node that it brings up, it's able to do that. If it thinks, oh, I can replace this one big node with a smaller, cheaper node, it's capable of doing that. It supports standard scheduling constraints for node selection as well. In terms of workload scaling, horizontal pod autoscaler is generally the one people are most familiar with. It increases and decreases the desired replicas to achieve targets. You can see a sort of example on the side where we're saying the target utilisation of this service is lower than all the current pods. In this case, the HPA would scale up the desired number up. It can scale on a number of different dimensions, so it's resource metrics, CPU and memory. Currently, that default is summed across the pod, cover that in a bit more detail than we talked about future work we'd like to do. It can also look at custom metrics that you can expose through a number of different methods, so things like QPS going into pods. Finally, external metrics. These are things that are not per pod metrics. If you're looking at a workload consuming from a queue, for instance, in your cloud provider of choice, you can choose to scale on that. If you see the queue length increasing, you can increase the number of replicas. You can have configurable scaling behaviours as well. You can have fast scale up, slow scale down. You can configure that per workload. Vertical pod autoscaler. This is aiming to right-size your pods in terms of their CPU and memory requests. You're looking at the historical resource usage of a pod and saying that pod has always only used 25% of the CPU that it's got. I'm going to decrease the amount of CPU that that workload is requesting and then bring that down. It's based on resource data, so CPU usage, memory usage, and OOM events as well. If you've got one where you're looking, say, a Prometheus instance that you're having to scale up and down, it will also monitor for those OOM events and take that into account. It recommends those pod sizes to keep the real usage well within the request pod capacity. You're able to run it in an effect with a dry run mode so that you can look at those recommendations, what the VPA would suggest, and then evaluate whether you want to apply them. Multi-diameter pod autoscaler. This allows the combination of HPA and VPA scaling on a single workload. Historically, it was recommended against running the HPA and VPA looking at the same resource metrics. That could result in slightly odd edge cases where you ended up with a lot of very small replicas because the HPA would scale a workload out because you were using a lot of CPU, and then the VPA, once that CPU started falling, would start scaling the resources down on those pods. It's also designed from the ground up to be extensible, so it will allow users to insert their own recommender, so if you're wanting to encapsulate some business logic because you've got a private pricing plan or something like that, you will be able to do that. You would write your own recommender and then just have the multi-dimensional pod autoscaler call like that. You might have heard this when we were referred to during a talk yesterday talking about that project. Balancer. This is intended to solve the problem of how do you ensure that you're getting equal spreading of pods if you're running a cluster across three zones in a region? How do you ensure that those pods are equally spread and remain balanced as you start scaling up and down? Other business problems. If you're looking to, say, be cost effective in using preemptible or spot instances, but you still want some of your pods on non-preemptible instances in case that you start losing that capacity, how do you do that? And how do you consume node types where you've got a negotiated rate or discount first? And then also perform the other forms of autoscaling I've already mentioned and make the CA work well with them. Finally, add on resizers. This is the simplest project we have in the SIGBI quite some way. It's vertically scaling a singleton pod proportionally to the scale of the cluster, and it's useful for components where the resource needs scale linearly or exponentially, potentially, with the size of the cluster, so things like metric server, et cetera, and you can use nodes or containers that is the metric to drive this scaling. So, yeah, those are the projects we own. I mentioned that the SIG is now taking ownership of Carpenter. So the SIG has been discussing this with Carpenter maintainers for quite some time. So it's an alternative approach to cluster and node autoscaling from cluster autoscaler. It's been developed in the open from the beginning under AWS's stewardship to this point. It's deliberately vendor-neutral, however, so it's been designed from the start to be vendor-neutral, allowed provider implementations to be developed as required. Over the past year, we've been discussing, okay, how do we adopt this into the SIG without resulting in confusion for end-users? We don't want end-users turning up and going, oh, I don't understand which of these projects I would choose. Why do you have two different solutions to the same problem? And also to enable, principally, surprise, so where the two things have the same behaviours, we want to make it as easy as possible for people to configure it, and have it work the same across both. It currently has AWS, unsurprisingly, but also Azure implementations. This was announced on Monday, but the repo has been made public yesterday, so you can go and check out Azure's implementation of this as well. And we've agreed on the adoption now of the project. So this is the core repo. This is the library that then is consumed by these implementations by AWS as your other cloud providers potentially going forward. So that repo is in the process of being migrated over to Kubernetes infrastructure and then SIG government. The one thing to note here is it's not replacing cluster autoscaler. Both projects will continue to be managed by the SIG. It is an alternative approach to addressing the same problem. So there are some differences in their approaches, but there's also areas where the two projects have different methods and annotations to achieve the same name. So cluster autoscaler is node group focused. So when it's scaling things up and down, it's looking at a node group in its internal implementation and assuming that if it scales it up, it will get a new thing that looks like the existing things in that node group. The majority of cloud providers are also baked in. We've currently got 27 cloud provider implementations baked in, as well as an external GRPC cloud provider to allow outbound extension. There was a talk earlier today talking about how this had been used with Virtual Cubelet, but it also allows, again, that encapsulation of business logic if you were wanting to create a cloud provider implementation where you had some business logic around what nodes you wanted to prepare there. It also has highly configurable scaling behaviours via thresholds and scanned falls in the primary. Carpenter, however, creates specific nodes, so you create the custom resources that drive it, and then it uses that to decide what nodes to create within constraints that you provide it and creates specific nodes rather than manipulating a node group. As I said, the core is just released as a library to be consumed by cloud provider implementations, so this thing is not going to be publishing images of the carpenter core. It's just a library that we can then be consumed by those creating implementations. As I said, AWS and Azure already have those implementations there. It's also able to perform consolidation downscaling, so if you've got all of your pods currently on a large, expensive node, your workload has then scaled down or has, say, been terminated, and then those pods can be removed onto the spare capacity that is there on existing nodes in the cluster, but maybe there's still a few pods that wouldn't be able to be moved and it will evaluate going, okay, but I can place those all on a smaller, cheaper node. It will do that for you. It also has the concept of drift detection, so those of you who are responsible for managing clusters might have to, say, roll out AMI updates every month. Carpenter is able to perform drift detection, see that a new version of that AMI has been released. I can do, replace that in the background. It can also do this via TTL, so, say, for instance, you know that your nodes for whatever, you want to rotate them every 24 hours just in case, you can do that as well. The next steps for the adoption, however, we're working on adopting the existing carpenter processes under Kubernetes-to-governance processes, so carpenter for a long time has had public office hours, they've been run completely open, however, we need to migrate that into the existing sort of Kubernetes SIG setup to ensure that we're meeting all the requirements of the wider Kubernetes governance. And we've also got agreed work on continuing to standardise across the node-off to scaling project, so there's a link to a doc there that Jonathan from AWS has put together and agreed with the SIG. That is basically setting out the short, medium, and longer term work that we want to do with the carpenter community and single-off to scaling community to move that forward. Okay, thank you guys. So, as I said, Cluster Autoscaler is not going anywhere, and it is also developed in the meantime as we are adopting carpenter, and new features are coming there. For example, one of the long-standing problems of Cluster Autoscaler was the performance of node-raining. So, in Cluster, some of the nodes after at some time may be not that much utilized as they used to be before, and Cluster Autoscaler will try to compact them, meaning move existing pods to some other nodes and delete the just-empted nodes. However, for quite a long time, this was not a parallel process of Cluster Autoscaler was deleting one node at a time, and even if it tried hard to be as fast as possible, the limitation of a single thread was hitting larger clusters. However, in 126, we finally introduced parallel-drain process to accelerate it. Now Cluster Autoscaler will be able to delete multiple nodes at a time and do it quite quickly while maintaining graceful termination period for the pods and making sure that all the pods have a place to go. So far, it was an optional behavior, but with 128 it is going to be default. So this ring will hopefully close this long-standing gap in Cluster Autoscaler performance, and you will get your cluster ring quicker, saving you money. In 127, we also added multiple improvements to this process, so if you test it in 126, it should behave much more robust and quicker in the US releases. OK. Ongoing work is also handling a demon set properly. So while draining nodes, it is important that we remove pods in some reasonable order. On the node, there might be some demon sets running that provide you with log-pushing capability. It would be very bad if we deleted this demon set pods before deleting the actual nodes. In that case, the logs that were written in the very last moments of the pod lifecycle could be lost. However, we did some improvements there, and now it is running in 129, and we will be doing node draining based on the priority, so we will make sure that every single bit of logging that you put in your application while being shut down is correctly pushed, and you will not lose any data. One big endeavor that we started in seek autoscaling was provisioning requests. It is an API to ask Cluster Autoscaler, or actually any other autoscaling that might be in place, for example Carpenter, to ensure that there is space for some set of pods. Currently, Cluster Autoscaler as a Carpenter works with pending pods, so in order to trigger scale-up, you need to create the spots in the first place. That might be good for serving workloads that are okay with getting half of the needed pods on scale-up, but it is bad for things like machine learning, which require all pods to be present in order to start their computation. Having only a fraction of them will cause more harm than good. The workload will not start in that case, and you will be paying on your cloud provider for these half-provisioned resources, and it will raise your monthly bills. However, with provisioning requests, we are trying to address this issue, so we established an API for which you will tell us how many pods and in what shape you would like to provision. This thing will be done by provisioning requests. You create a provisioning request, you put a play, you put a count, and Cluster Autoscaler responsibility is to let you know when cloud provider gives a green light for the nodes needed for these pods will come. You create provisioning requests, you wait only when cloud provider says it is okay. I've got capacity, I will give you this 200 of the most expensive GPUs. When cloud provider says that, Cluster Autoscaler will change the status of provisioning requests and you will know that you could start your batch workload because the capacity will be there. There is a question, how and what we will be actually providing, and this is defined by a thing called provisioning class. Right now, there are three classes that are there or are shortly coming. The first one is to just ask Cluster Autoscaler whether it has currently the capacity for a particular set of pods. Cluster Autoscaler will say, yeah, I have it or no, I don't have it based on what is there without trying to scale the things up. Then we will have generic scale up that will do its best to give you some level of atomicity and cost control with the scale up. However, this type of provisioning is not integrated with any goodies that your cloud provider might have. For example, an ability to ask directly whether it could give you 100 nodes or not. So, why do we have this thing? Because we have, as Guy said, 27 cloud providers on board and it will take time for them to integrate with the provisioning request and modify their integration so that it is handled the best way possible. In the meantime, we will try to do our best to give you some level of atomicity, some level of control over provisioning. For Google Cloud, we already have this type of functionality where you can ask for resources and you will get them in atomic way. We are tryblazing a path with Google. However, we will be more than happy to accept contribution from any cloud we are integrated in so that the users of cluster auto scalers get the batch AIML friendly behaviour in their cluster auto scaling. Because now we have Carpenter on board, hopefully this functionality will also come there. Okay, and batch workloads are not the only use case there. It's not the only use case. It is also useful for making the scale up faster. If you know already that you will be creating 10,000 pods for whatever reason, you could issue provisioning requests and while creating pods, provisioning requests will let auto scaler know that these pods are coming in that quantity and cluster auto scaler can do a single scale up instead of doing like a step function which will take more time. Okay, so there are also a couple of minor improvements coming to cluster auto scaler. Cluster auto scaler will not do a partial scale up if calculating scale ups, helping out pods will take too long. Oftentimes cluster auto scaler is bombarded with huge amount of pods and all of the computation and evaluation of simulation take too long and cluster liveness probe fail and due to that cluster auto scaler crushes and is restarted and getting into some unpleasant loop. Now we have more control over execution and we limit how much competition is done so that we don't crush, we don't break under pressure. Maybe not the best decision will be made but we will move forward and cluster auto scaler will work with huge amounts of pods coming in. Okay, vertical pod auto scaler finally got its 1.0.0 release after a couple of years and new additional features are the control over how we enhance and expand memory requests after we notice out-of-memory exceptions happening and we also added their ability to specify that you would like to change CPU in integer increments so that you have integer number of CPUs only. By default, vertical pod auto scaler tries to give you the number that is as close to the actual usage. Sometimes it results in fractional numbers which are okay for most of the users but there are use cases where you would like to have integer number of CPUs. We have a couple of additional efforts in progress. One of them is to control VPA eviction behavior based on the scaling directions so for example, scale only up, not scale down always consume more and have less disruptions that didn't make to the 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0. So it should get into very next release and the second thing that we have been dreaming about for quite a long is using the in-place pod upgrades. So previously, whenever we wanted to change the pod request we had to restart the pods which was okay for some of the workloads but some of the workloads could not tolerate it and it was very disruptive with it. So if the pod request support finally getting into Kubernetes we hope to have it implemented in vertical pod of the scalarsum. So okay, so future work. What we have there, so if you want to help us or if you are interested in what is coming close we should have container resource metrics, auto scaling, shortly move to GA we should finally have scaling up from zero in HPA based on external metrics. We hope to have this in-place update support in VPA and finally land a multi-dimensional pod autoscalar implementation so this is available for you to try. Okay, thanks. Are there any questions? And before we proceed to the questions we do have a weekly SIG meeting every Monday that's 8 a.m. Chicago time. Remember 6, 4 p.m. European time and you are more than welcome to come to this meeting. Let us know what about your problems with autoscaling, your ideas and we are there every week to help you with your autoscaling issues. Thank you. And we have a mic over there for questions. Hello. So I have a question about cross-out scale and carbon difference. I mean, like is there any clear difference in use cases between them? I mean like when to... How to know when to use which one? We are using cross-out scale country but should we move to carpenter or stick to the cross-out scale? How to know that? First of all, what cloud are you on? First question, what cloud are you on? GCP. GCP. Right now there is no implementation of carpenter for GCP so cluster autoscaler is your way to proceed. In future maybe there will be implementation of carpenter on GCP. Right now there is no carpenter as Skyset currently has two integration. AWS which has been there from the very beginning because the project was started by AWS and just announced Azure implementation. So if you are not on any of these clouds, cluster autoscaler is the way to go. If you are on any of these clouds then you have to consider how you manage your nodes. What functionality do you want to have? Do you want to have node groups and have this behavior that has been in Kubernetes for a while and you have groups and scale groups up? Or you want to have completely customized nodes and you want to have more control over what is being created and when? Then maybe carpenter is a choice. However, we are trying to make these projects converge to some extent so that it will be easy to switch between one to another assuming that they both provide the implementation for your cloud. So we are aligning the APIs, we are aligning annotations, we are aligning the namings like how do we call scale down or is it compaction, the fragmentation or whatever. So we are trying to make it easier for you to understand what are the differences and also easier for you to switch between those and more fit your needs at a given time. So eventually it's the choice of how you want to manage. It's hopefully not the choice of like that one is worse or better than the other. We have both of our new projects and the old one as well. So we will be developing both and it's for you to choose how you want to manage stuff and that's kind of like the main difference. In principle, the both projects do more or less the same. They create nodes for ending pods and delete nodes for delete nodes that are not used. Right, thanks. Welcome. So a common, so with the cluster auto-skiller and I think, I'm not sure, in order to determine how to scale up they need to run simulations of the coop scheduler. But that gets complicated when you are running a custom scheduler or you've extended the coop scheduler either with custom configuration or you've written custom plugins. Do you have recommendations on how to use cluster auto-skiller with and maybe carpenter with a custom scheduler? It depends on how custom your custom scheduler is. If it's doing completely different things then the regular scheduler then you have a problem because right now we cannot support a completely arbitrary logic and placement. We do simulations there to be sure that after we create a node the pods will go there. If you have completely custom logic then I'm not directly seeing how we could do it. We cannot talk to a custom scheduler because the API, first of all, would be complicated and we do so many simulations on scale-up and scale-downs that the communication between two processes running possibly on different nodes would kill the performance on bigger clusters. So unfortunately for now the answer is try to be as close as regular coop scheduler as possible. You can have different node-ranking functions but try to keep basic constraints when the pod can fit into node or not aligned with a coop scheduler. If you try to be as close as possible then probably it wouldn't matter that you have a custom scheduler running together with carpenter and cluster scheduler. If you stray too much then one thing will think differently than the other and you will get into various types of problems. Thank you both so much for the talk. In place vertical pod autoscaling is something we're very excited about. Can you talk a little bit more about the status of that project and maybe where support could be helpful? So we have a cap that defines what we'll be doing. We have some people working on it so it depends on where these people will finish. If you have strong interest then we will be more than welcome to come to seek and help us with implementing it. Thanks. Just to cover out from that, the wider work of the node signal, that work is progressing but that's not actually... Our work is to take advantage of it but I think it's still alpha is the last I heard. I'd like the actual node in place update so you would need a cluster that supports the behaviour to take advantage of the VPA implementation of it. Gotcha. Thank you. Big fan of your guys work first off but my question is kind of, I guess it's HPA and VPA but also falls in line with the multi-dimensional autoscaler. Do you guys have any use cases where people have used both of them on metrics like CPU and memory because I noticed on the GitHub page it said to not use VPA with HPA if you're going to be doing, scaling off of those resource metrics because right now we're in a stage where we want to optimise the resource usage. We don't want to over provision so would there be a case where you would put in dry run mode on VPA to get an idea of the resource metrics to use and leave HPA on or does HPA kind of skew on metrics VPA looks at? So the problem with running HPA and VPA on the same set of metrics is that by default HPA uses a percentage target. So you specify that you want your pods to be utilised up to let's say 50%. If it goes up you want to create more pods if it goes down you reduce the number of pods. VPA changes the size of pods. So now this 50% means slightly different thing. And if by any chance it grows it too much for example beyond the scalability limit of your pods then for example you may always be below 50%. Even though your pods are trying very very hard to process the request. What will happen if you are always below the target? Your deployment will shrink and it will shrink up to minimal size because it will always require fewer pods. So in order to have it supported you should use something that doesn't connect VPA and HPA that close together. If you want to use CPU you could try using absolute value. Then the absolute value will be kind of independent from the size of the pods because as long as the size of pods is bigger as of the absolute value it is okay. But you need to be very careful with that because there are a couple of edge cases there. We recommend using slightly different metrics for HPA for example number of incoming QPSs which has of course strong correlation with CPU usage however it is more kind of independent from it and then some strange interaction between HPA and VPA are less likely to happen. Thank you so much. Thank you for the work and thank you for the talk. I have a question from an end user perspective. How would they think about whether to use auto scaling versus some of the serverless offerings from the cloud like GKE autopilot or like Fargate? I cannot speak for Fargate. I can speak for GKE autopilot. GKE autopilot underneath uses cluster auto scaler. So it solves the problem of configuring cluster auto scaler for you. So if you want to have more control over your cluster then you use GKE classic if you want Google to handle your auto scaling for you then you can use autopilot. Eventually both GKE and autopilot will run more or less the same auto scaling code. Auto scaling will happen for you but with autopilot you have one problem less and with GKE classic you have one knob to control more. I think my personal experience of work is more 80s. Fargate, you're always giving up some control for those serverless offerings and certainly in our use case we need some of that control that we're giving up for most workloads. Fargate isn't suitable. There are certainly workloads and talk to other users of it that would be able to use those. There's always going to be trade-offs though. I think it's always just a matter of evaluating for your workloads whether you can tolerate what the limitations of said serverless offering are and actually whether it will definitely result in cost savings. Thank you. Looks like we don't have any more questions if you want to ask us more private questions. We will be hanging around for a moment. Otherwise, thank you for joining and hopefully you will have the great last day of QPCON tomorrow.