 Hi everyone, welcome to our session on groupless auto-skilling with Carpenter. My name is Prithee Gogia and I'm here with Alistan. We both work in the EKS scalability team at AWS. We have been working on the Carpenter project for the last few months and today we will be talking about the current auto-skilling landscape in Kubernetes, how Carpenter works and a quick short demo. There are multiple ways a pod can be created in a cluster. Users can use a CLI tool like Qtctl or there could be a crown job running or there could be an auto-skiller tool such as HPA, KDA, Knative which could add these pod into the cluster. Once these pods are added, they need resources to be able to run. These resources can either be added by a manual effort or by an auto-skiller. Capacity planning is hard and manual scaling can be slow, error-prone and cost significant delays as compared to an auto-skiller. Auto-skiller has benefits over manual and today cluster auto-skiller has become the de facto for auto-skilling in Kubernetes. So, what are some of the benefits of cluster auto-skiller? As pod scale up, load increases in the cluster. Cluster auto-skiller can request more capacity from the underlying provider and when the load reduces, the cluster auto-skiller can also request to terminate these instances from the provider which in turn helps user reduce their infrastructure cost and remove the burden of capacity planning. There are other advantages such as cluster auto-skiller is vendor-neutral so has support for all the major crop providers. The approach is battle tested and widely adopted and it works great for cluster sizes above 1000 nodes. So, let's take a look at how cluster auto-skiller works. Cluster auto-skiller works around this concept of node groups which is an auto-skilling group or an ASG in case of AWS. It looks for pending pods, sets the desired count of instances based on the number of pending pods on a node group. Once the underlying provider provisions the capacity based on the desired count, kubes scheduler places these pods onto these new nodes being created. Node group here plays a critical role by maintaining the desired number of nodes which are requested by the auto-skiller. So, let's try to understand what node group is. Node group is a logical grouping of nodes which can be auto-skilled and managed together. Cluster admins can create these multiple node groups specifying instance type options and some other configurations and also set a minimum and a max node count. Node groups are not auto-selfilling. It's done by an auto-skiller by setting the desired maximum. There are some challenges with this approach though. Let's see what are those challenges. Cluster auto-skiller assumes that the instance types are all identical in a given group. And if you want to use multiple instance types, you need to create separate groups. And furthermore, if you need to spread pods across AZs, multiply the number of groups by each AZ. Some of the other challenges here are if there is an error creating a node in a node group, cluster auto-skiller will not know until it times out and this causes delays for an application when a pod stays pending. There are users creating clusters larger than 1000 nodes today. But there have been some efforts from the community, but unfortunately either they are deprecated or not maintained. And one of such efforts is from Cerebral which is Cerebral which was created because cluster auto-skiller was slow. Another one is Escalator which is a batch workload specific and works on a selected set of groups in tandem with the cluster auto-skiller to fix some of the challenges we just discussed. Another interesting approach is from Zalando which tries to improve and build upon cluster auto-skiller. They improved how cluster auto-skiller generates template nodes for groups to decide what kind of node to expect when scaling up. They added reliable pack of logic in case of node to phase to scale up nodes. And they made some more additional changes and some of their efforts which have been upstreamed to cluster auto-skiller which makes it more robust and productionary. But one thing to observe here is all these different approaches are trying to solve around node groups. Keeping these challenges in mind, now I would like to hand it over to Alice to talk about if we can do any better in terms of auto-skilling. Thanks Pateek. So we spent a fair amount of time in this node auto-scaling space and we're trying to understand what was exactly the root cause of all these issues. Some of them were just code bugs in our cloud provider implementation, things that we could fix, you know, things that we did fix. Sometimes it was just customer configuration since these things are really hard to get right. It's really easy to misconfigure node auto-scaling. It's one of the things that we decided we really wanted to focus on to make easier. And finally part of it was just architectural. By nature of having node groups, you take on some assumptions and that was where we decided that a lot of the inefficiencies were coming from. So we really started to play with this idea of what would happen if you remove the group? What if we provisioned capacity directly? The idea is that when some pending pods come in, you can compute the resource requirements of all those pods just like the cluster auto-scaler might. But instead of simulating from an existing set of static templates of a set of nodes that could be created, you just ask EC2 for exactly the instance that you want. So we know all of the instance types. We can simulate those pods on, you know, we can pick an M5 large or any instance that makes sense for those pods, whether or not it's been seen before. And additionally, the other node properties like the availability zone and the operating system, all of these things we can compute dynamically. So we've shifted from this static template world to this dynamically generated template world, which reduces the configuration complexity pretty significantly. So once we've decided what instance we want, we just tell our cloud provider, please give us this instance. Of course, one of the value props of groups is they track nodes so that you can, you know, you can roll out a change or you can delete them all at once. So we decided to track nodes using Kubernetes labels, which we thought was a little bit more Kubernetes friendly and native to the other tooling that you're using. So the result of all this is you have pretty significantly reduced configuration burden. We've seen customers deploying over 50 ASGs in production, which is the cross product of instance type capacity type availability zone and scheduling properties, which can blow up really quickly. So you just don't need any of those anymore. It's all done dynamically. Also, the API load is pretty significantly reduced since for every single one of those ASGs, you have to ask the cloud provider, you know, what is what is the current status of this thing? What is the what is the value? And those API requests really, they really stack up. And you can run into API limits and the whole system grinds to a halt. So instead, we just make calls when we're creating and deleting capacity. And otherwise, we rely on the Kubernetes API server, which can support much higher throughput. It also reduces the simulation complexity, we no longer have to to look at such a large number of templates, we can collapse the simulation space to just instance types because we can prescribe the availability zone and scheduling properties, which really simplifies things. For example, instead of looking at all the possible groups and saying which one has this zone, we can just tell our cloud provider run an instance in this zone. So we ended up with this architecture, which should seem fairly familiar if you're familiar with the cluster auto scaler. We start up at the top left where you have pods that were created and haven't been scheduled yet. And the coop scheduler gets the first crack at it. So it looks at the existing capacity and the and the pending pods and it tries to find places to schedule those pods. And if for some reason, the resources aren't available to schedule them, the coop scheduler sets a status condition in the pod that the pod cannot be schedulable. And that's when we kick in. So we have a process called the allocator that looks at pods that have that unschedulable status condition. And then we attempt to create new capacity that can fit those pods. We try to make sure that capacity is as efficiently packed as possible, and that we can bring it up as quickly as possible. And then that capacity joins the cluster. So your existing capacity and your provision capacity, just now are your new cluster state. Over time, you can get you can get holes in the capacity you can get inefficient utilization. So we have another controller that looks globally at the cluster and attempts to optimize it over time. So where we end up with is two controllers. One is a fact fast acting latency sensitive controller, the allocator, and the other is a slow acting cost sensitive controller called the reallocator. And together, these work to both schedule pods as quickly as possible, and optimize the resource utilization of your cluster. So there's a couple ways to launch capacity directly. Well, there's many ways to launch capacity and easy to at least. So previously, we were relying on ASG, we gave it a template and we gave it an instance count, and it was responsible for launching all of those instances. Now we've removed ASG. And so we want to use EC to direct capacity launch API's. There's two of these, one is called fleet, and the other is called run instances. They're roughly the same, we could actually use either of them. And I imagine that other cloud providers, they might have multiple options as well. We know that every cloud provider does have the ability to run instances in general. But we for our use case and for our cloud provider, we selected the fleet API, because we think that it gives us the most flexibility. So for example, when when deciding what node we want to create for a given set of pods, we don't actually limit ourselves to one availability zone and one instance type. And this is constrained to our cloud providers implementation, you could have another implementation that selected the specific instance type in the specific availability zone. But we actually can get it some additional value by being flexible. So we tell EC to these pods will fit in any of these instance types. And given their topology constraints, it'll run in any of these availability zones. And by giving it that flexibility, EC to can actually pick the cheapest instances given the constraints, the more flexibility you give easy to the better of a job that can do to pick the right instance type given your constraints. So it's almost like it's almost like a run instances request that's been expanded and made more flexible. But at the end of the day, you are still getting one node that those pods get packed onto. So we get pretty massively increased node flexibility here, we get this really nice failover behavior of a particular instance type is out of stock. It also pretty significantly reduces the node provisioning latency. So systems, systems like auto scaling groups and other group systems, there's there's a little bit of overhead to them. So there's some reconciliation loop that's happening in the implementation of that thing. And, you know, maybe it's really fast, or maybe it's not very fast. But we found that with ASG, it was adding about 30 seconds from the time that you increment the node count to the time that the node is brought up. So by removing that and asking EC to to create fleet directly, we get instances 30 seconds faster, which is a pretty big deal. Scheduling is also critical to the provisioning workflow. So like I mentioned earlier, we work in tandem with the coob scheduler, we give the coob scheduler the first shot. And when it fails, we kick in afterwards. But one of the corollaries to this is that is the realization that provisioning decisions are scheduling decisions. By nature of picking a new node for pods to schedule on to you are making a scheduling decision whether or not you like it, or whether you like it or not, it's it's it's unavoidable. And so one of the differences that we were able to do because we're creating the capacity directly is we can actually create the node objects. So rather, we make the the create instance request, we get an instance ID back, we're able to create the node object immediately, and then bind the pods immediately. And so we've enforced a scheduling decision kind of outside of the loop of the coob scheduler. And that's a fairly interesting pivot from the way that the cluster autoscaler works, which is it brings up the new capacity and then waits for the coob scheduler to do the scheduling. But we decided that if we're already making a scheduling decision, we might as well enforce it. And there are some benefits to that. So for example, because we've already bound the pod, the second that the node comes online, you can start pulling the image because you already know the binding decision, you don't have to wait for the node to become ready, you don't have to wait for the coob schedule scheduler to schedule it. So that was just kind of a nice dual optimization we did. So one is we enforce that the decision that we made is actually is actually it actually happens in the cluster. And we also get some, you know, about five seconds of latency improvement, which is which is quite nice. The other thing that we realized is that the cluster autoscaler is limited in its version compatibility. So they copy the coob scheduler directly into their code base, and use it. And so because of this, there's a recommendation that the coob scheduler must be the same in the cluster autoscaler's code base as as you have running upstream, or as you have running in your masters rather. And so this this creates a tight version coupling between those two things, because they're not enforcing the the binding decision. And so by doing this binding binding decision, we now free ourselves from this version compatibility, because the the coob scheduler is its own system that has its own set of responsibilities. And the carpenter allocator scheduler is is the same is it also has its own responsibilities. And so you no longer have this type coupling and versions of carpenter can work across many versions of Kubernetes. Another corollary we have here is the scheduling API conformance. And so by nature of writing a scheduler or writing logic that decides new nose to provision, we have to support all the common things like resource requests and selectors and affinity and topology. And so this is this is just kind of a burden that we take on as part of trying to play in the salt in this problem space. And specifically, the resource requests need to get a little bit complicated. So when you have, I'm sure you guys are familiar with running GPUs or running potentially other custom instance, or custom resource types, your cloud provider effectively needs to have knowledge of how to launch instances with those particular instance types or resource types. Sorry. And so there's kind of an ongoing burden here of the cloud providers to ensure that the the particular properties of the instances that the pods are requesting are actually able to be launched. And we built a bunch of machinery that's that's vendor neutral that supports the node selectors and all of the scheduling that the non then non vendor specific scheduling constraints but there is there is somewhat of an ongoing maintenance burden as as new types of resources enter the enter the scene. We also operate based off of this concept of well known labels. So for example, this is how we communicate that a pod needs to run in a particular particular zone or in a particular instance type. There's this concept that if in your pods node selector, you use for example, topology dot Kubernetes dot io slash zone and you state that zone, that's a signal to carpenter that the the node that you need must be constrained that then it must have a label of that zone. So we use these well known labels to apply additional constraints. There's also a set of defaults so that if you don't, you don't define these or you don't have to define these in all cases. But if you do need to do apply additional constraints, you would do that via node selectors and well known labels. So once we've sorted out our scheduling, you end up with kind of groups of pods. Imagine you have 100 pods that need to be scheduled. And some of them have once another node selector and some of them have another node selector. You end up with these what we're calling constraint groups. And these are pods that that can be scheduled together according to their scheduling constraints. But we still we have another problem here, which is how do we pick the instance that can fit all of those pods. So we sum up all of the resource requests of the pods. And then we do what's called bin packing. This is a classical computer science NP complete problem. There's a couple of variants. The one that this is is called online bin packing, which means we're looking at the subset of the overall pods in the cluster. And we're attempting to pack them into the existing capacity. There's a bunch of research in the space and we're actually super excited for all the innovation we can do here. But the algorithm that we've picked right now is called first fit descending. And what that basically means is we sort the pods by size. And we try to pack the pods biggest to smallest. And what this does is it kind of has this nice effect. It's called first fit descending has this nice effect where you attempt to pack pods. And if you ever can't, as you go down this list, this descending list, if you ever can't fit a pod, you set it aside and you go smaller and smaller and smaller. And so as you get to the smaller pods, they tend to fit in the gaps that you've created by starting with the largest one. And this is actually, there are some great analogues here to actual container shipping and and how containers are packed in that world. But the main idea is that you take a set of pods that can all be scheduled on a set of nodes together, they have the same scheduling constraints. And the result is you end up with a set of nodes that can fit those pods. You might ask, well, how do you order the pods? Because it's not a single dimensional problem. There's a couple approaches to this, you can either prioritize CPU or memory or do the Euclidean, which is the sum of the root of the sum of the squares. And so we found that the Euclidean is kind of a best effort simulation of kind of waiting memory and CPU equally when sorting them. And that it tends to work fairly well for bin packing. But like I said earlier, we're really excited to innovate in the space and figure out better algorithms. We've talked a little bit about genetic algorithms, which could be really interesting. And there's a whole field of research here. We're not just responsible for bringing up nodes, we're also responsible for terminating them. It's, you know, what goes up must come down. And so we have we have another controller as part of the rather it's part of the reallocator controller. We look at nodes that don't have any pod schedule to them anymore, and we apply a TTL. And if those pods if those nodes don't get any pod scheduled to them for a period of time, which we default to five minutes, those nodes are eventually scaled down. We also respond to other events that might cause instances to be terminated. So for example, when EC2 notices that your virtual machine is unhealthy for some reason, it'll send an event. And so we respond to those and we we drain the node as quickly as possible and try to try to heal the situation. Similarly, there's a couple of other signals called spot rebalance and spot termination. Rebalance is kind of a heads up from spot that your node is going to be preempted. And so you can do your best to drain and reschedule everything. Termination happens after the rebalance. So you're not guaranteed to get a rebalance request, but you typically do. You are guaranteed to get a termination signal. And that termination signal means you have two minutes to drain. So ideally, we've we've managed to terminate the nodes and reschedule everything by the rebalance. But if for some reason it happened so quickly, or some things getting gummed up, then it has to be done before that that node termination. The other thing that we've realized is that you can apply a node TTL of a long timeframe. So for example, there's there's some value in saying I want this node to live for 90 days. And this is super useful in the upgrade case. As long as our allocator is creating nodes on the latest Kubernetes version. And as long as the nodes only live for 90 days, your cluster is basically always up to date, given the Kubernetes three month release cycle. There's some other compliance use cases where you might care about how long the node lives. Maybe you want to roll your fleet every month or so. But this is another feature that we built in that that we think will be super useful for upgrade, you know, potentially weird memory leaks, potentially compliance cases. Of course, you can not use the TTL and your nodes will live forever. So previously, I mentioned that we will only drain nodes that are empty that don't have any pods on them. But you could also imagine that over time, most of the pods get deleted, but not all of them. And this can end up with what we call defragmentation, where you have nodes that are really poorly utilized. And this is where this this reallocator process kicks in. In addition to the really easy case where we just scaled down an empty node, we can look at poorly utilized nodes, and ask the question, is there a better selection of nodes within the cluster that we can replace our current nodes with that will more efficiently pack these pods? This is actually a really challenging problem. Because pods are immutable, they can't be moved, you have to be really careful about deleting a pod and waiting for the new one to be created, and then enforcing your repacking. So this is still very much in the design phase, we're still trying to figure out how to do this, but we're committed to the idea that you shouldn't be stuck with the the initial state that things were provisioned in, you should have another controller that's always monitoring always trying to improve the situation. So we're exploring different algorithms here. Also, one of the most important pieces of this problem is one that doesn't really come up very often, but we realized is absolutely critical. It's how quickly are these nodes launched? There's a whole bunch of steps, they're all listed here in the critical path of a node coming online. And right now in AWS, that's about two minutes, in different clouds, I think it's actually fairly common for it to be about two minutes. And when it's this slow, you end up with this use case where you actually want to over provision, you can't wait those two minutes. So you'd like extra nodes to be in the cluster so that the pods can be scheduled more rapidly. And so we've decided to really attack this problem, we think we can get it down to maybe 50 to 30 or 15 seconds if we do a ton of work. And so we're really excited to to attack this this latency problem that which we think will mitigate the need to do over provisioning. Of course, you know, there are going to be use cases where you need to over provision, which we recommend doing at the pod level. And in the very worst case scenario, if you can't over provision at the pod level, and you can't tolerate the latency at whatever we've optimized for the point in time, we recommend that you just statically provision your capacity. Of course, we're always open to new ideas. And we've actually had some discussions about some sort of feature about leaving headroom. But that's not fully baked yet. I also want to mention quickly that we do this using Kubernetes custom resources. This is kind of the modern standard way to write controllers in Kubernetes. We found that using command line arguments and a single global operator is kind of untenable. So all of these configuration parameters, the taints, the labels, the cluster that the new nodes connect to, all of these things are encoded in a custom resource. This project is, of course, open source and vendor neutral. It's currently incubating in the AWS labs GitHub repo. And we're excited to figure out where it's going to go after that. We have a roadmap that you should come check out to see our 0.2, 0.3 and 0.4 releases that we have planned. We're planning on executing about the rate that Kubernetes does so releasing as Kubernetes releases. And the other thing is we have a biweekly working group where people come from Segato scaling or from our customer base or really just people who are curious and we talk about the project and talk about things that we could do to make it better or prioritize features or design reviews. So if you're interested in that, we'd love to see you drop by. There's a QR code here if you're interested. And now I'll hand it off to Patik who's got a demo for us. Thanks, Alice. So let's go ahead and see some of the concepts that Alice has explained in action. I have already provisioned an EKS cluster. It's running a single node. And I have also deployed carpenter onto this on this one node. So at the moment, we have carpenter running some of the components which carpenter needs and some of the system parts. So as a next step, let's go ahead and keep a watch on all the nodes on this cluster. At this point, we are ready to configure carpenter to be able to talk to the API server. So these are the configurations that we need to provide to carpenter to be able to talk to the API server. And the other configurations I have here are optional. We provide a list of instance types to limit the the instance types carpenter can use when requesting for the capacity. We are also limiting the zones in which carpenter can create the capacity and detailed second still carpenter for how long to wait when it detects an underutilized nodes before it terminates the node. So let's go ahead and create the CRD object which carpenter can watch. And using all this information, carpenter will be able to talk to the API server. So at this point, we are ready to add some workload into the cluster. But since we have this node tainted and it's only running system critical components, no other workload can be scheduled on this one node, we will need additional capacity added to the cluster. So let's go ahead and add a deployment into the cluster. This deployment contains a single container and replicas right now are zero and reduces a pause image with CPU request for one core. And before we start to scale up, let's keep a watch on the pods which are getting created as part of this. Let's also go ahead and tail the carpenter logs to see the actions carpenter is taking. At this point, we are ready to scale up our workloads and let's start with a single replica. So once we create the replica, what's going to happen is this particular pod is going to go in the pending state because there is not enough capacity in the cluster. And carpenter will request for more capacity from the underlying provider, which is AWS in this case. So as we scaled up, the pod went into the pending state and carpenter detected that there is one pod which is pending and it requested a node from EC2 and it binded this one pod on this particular node. So let's give it a few seconds. And this is the node which is coming up. And in another few seconds, we should be able to see the version instance type zone and architecture. So here it is. We are running on an X large in US East to see an AMD 64. And as soon as the node comes up, the container will start getting deployed and will be running in few seconds. So in about 49 seconds, we had the container running on this node, which was requested by carpenter because there was a pending part. Let's try to add two more parts of this in the same deployment and see what happens. So once we add two additional parts, what happened was because this extra large node had sufficient capacity, these parts went into running state and they were never detected by carpenter. Let's take it a step further and try to deploy some additional parts. So once we add four more parts and carpenter detected all four parts and requested another instance from EC2 to add these four parts. And this is the EC2 instance which is coming up. Similarly, in the next few seconds, as Kubelet comes up, we should be seeing the version instance type zone and architecture. And these parts should go in the container creating mode. So at this point, the image for these parts is being pulled. And once the image is there on the node, they should go into the running state. And the difference to note here is carpenter distance selected at 2x large because we needed four cores for the four parts we were creating. And then the zone it selected was 2b. So at this point, we have all the seven parts running. Let's also go ahead and check how bin packing works in carpenter. So this time I'm going to add 100 parts in the same deployment. And once these parts go in pending state, carpenter detected that there are 55 pending parts and it tried to bind all these parts on the same node. And then it detected there are 35 parts on the second and tried to bind these 35 parts on the second node it requested. So the way this works is carpenter runs in a reconciler loop every 10 seconds checking for the pending parts. And in the first reconcilation loop, it detected 55 parts. In the second one, it found out 35 parts. 55 parts it requested a larger instance type and for the 35 is requested at 12x large. We can verify this. So a 97 is a 12x large and 174 139 is a 24x large. So as these instances are now coming up, we can see these parts are getting into container creating. Some of the parts are already going into the running state here. So let's give it a few seconds and all these parts should just go into running state. At this point, we can quickly check how many are in the running state out of 100. So we have 95. Give it a few more seconds and see once we reach 100. So we have 100 parts and we can go ahead and watch the parts again. So it took about two minutes for all these parts to be created across two nodes which are 12x large and 24x large. Next, let's see how scale down works. So let's say the load has decreased and the replica count is being set to zero now. At this point, all the parts are being terminated across all the four instances that we just created. And once all the parts are being completely removed, Carpenter will detect that these nodes are underutilized and Carpenter will wait for 10 seconds. That was the TTL seconds we configured earlier and remove these nodes by draining these nodes and then and then terminating these nodes as a result. So we started terminating one by one and two of the nodes are being removed. So we are left with two more nodes which are 12x large and 24x large. So at this point, all the parts are being terminated and Carpenter has already removed three nodes and we are just waiting for the last node to be removed from the cluster. And these nodes are also being deleted at the EC2 level. So they are being terminated in AWS also. So now we are back to where we started with a single node and no workload at all. Let's say as a use case, you want to schedule some of your parts on a different architecture for a node. Let's say you want to use an ARM64. So let's repurpose our deployment and try to run the parts on in this deployment now on an ARM64. Let's keep in mind we never created any auto scaling group or a node group for auto scaling based on particular architecture or instance type. And let's go ahead and create two replicas for these parts and they should come up on an ARM64 type architecture node. So they both went into the pending as expected. Carpenter detected two provisional parts which need capacity and it requested a node from EC2 and binded two parts on this particular node. And in this case, the architecture it requested was an ARM64. So in another few seconds, once Cubelet is up, we should be able to see the version instance type and the zone again in which this instance is being created, which is an ARM based node. And these two parts should get created on that particular node. So it took about 56 seconds and the instance type it selected was a T4G extra large and the zone it picked was a US EC2C. And we can see that the container are in the container creating more and the image is being pulled in the background. So they both are in the running states on an ARM architecture. So yeah, this is all I had for the demo. Thanks for joining our session.