 Welcome. We're going to get started soon. So let's get started. My name is Carlos Santana. I don't play the guitar. Actually, I showed you how to break out. So today we're going to talk to you about our experiences working with end users that are migrating from Carpenter. Migrating from node groups or a traditional way of auto-scaling to use the new CNCF project Carpenter and how being seen and users migrating using Argo workflow. So we even color coordinated. I'll take the part of Argo workflows. I've been working with Argo workflows for last year and a half. Argo CD more than that. And then Raj will talk about introducing himself and talk about Carpenter. He's wearing the blue shirt. Carlos is tall. Thank you for coming to this talk. My name is Raj Deep Saha. I'm a principal solutions architect for containers and serverless, working at AWS, engaged with Carpenter from early days and helped multiple customers migrate. All right. So who here is using cluster auto-scaler to scale their Kubernetes clusters? Lot of folks. All right. So cluster auto-scaler is great. Of course, it is a CNCF project part of the Sega auto-scaling. However, for those of you who are in the platform team, you know, with cluster auto-scaler, you need to create node groups. So this situation when I talk to the end users is quite common. There is a platform team. You create a node group which can run certain types of virtual machines because you have to specify what kind of worker nodes you can run in this node group. Now, if an application team tries to deploy a different kinds of workload with all the JNAI hype, perhaps a GPU workload, but there is no compute node group for a GPU worker nodes, so this pod will not be able to be scheduled. So what you, the platform team needs to do is you need to log in, create another node group for GPU instances, and this node group can provision GPU worker nodes, and then that pod will get scheduled. So this is one of the challenges that our end users face. So now, I think Carpenter is the latest new generation cluster auto-scaler. This is also a CNCF project. AWS started it. We donated it to CNCF, and now it's part of the C auto-scaling. With Carpenter, think of this as cluster auto-scaler with steroids without the side effects. With Carpenter, you do not need to maintain any node groups. So let's say the same scenario. We have a C5.2x large virtual machine running, and a team comes and deploys a GPU workload. You don't need to go and create a new node group. Carpenter, based on the workload, will automatically provision a GPU instance, and then the cube scheduler will schedule the pod into that node. So Carpenter provisions appropriate instances based on your part specification file. It is also faster than cluster auto-scaler. So I'm certain a lot of you are new to the Carpenter because it's relatively newer project. So this is how the flow in relation to cluster auto-scaler. So you have a horizontal part auto-scaler, scaling the pods, and at a certain point, all your existing worker nodes runs out of capacity. Then the pod goes to pending unschedulable. And with cluster auto-scaler, that triggers cluster auto-scaler. Cluster auto-scaler talks to auto-scaling group, and auto-scaling group provisions EC2 or worker nodes. With Carpenter, this part is skipped. You do not need to create and maintain any auto-scaling group. Carpenter directly talks with virtual machine fleet and provisions worker nodes. And you can control what kind of virtual machines and what kind of AMI you want using two YAML files. We call them node pool and node class. Also note that all the concepts that I'm talking about, they are not any cloud specific concepts. You can extend this and use in any cloud provider. So let's take a look at a sample node pool YAML file. On the right, we have the YAML file. So you could see the kind is node pool. And you have a lot of flexibility on specifying what kind of instance type you need. So for example, you could say instance category CMRT, instance size, not in nano micro small medium, and instance hypervisor nitro. You also have the option to skip all of it. And Carpenter will automatically determine the best instance type from all the instance types that's available to it. You can also limit how many instances this node pool can provision. You could see the limits on the bottom. So as soon as the collected number of virtual CPU reaches 100, this node pool will stop provisioning instances. And one of the advantages of Carpenter is it always prioritizes cost. So it will always dynamically determine what is the best instance type based on your workload as well as the cost. And going one step further, not only you can specify the instance types, you can also have availability zone flexibility. So you could specify, for example, hey, spin up my instances in US West 2A or US West 2B. And you can also keep this totally blank. And this will determine the availability zone with the highest capacity. Now the important part is Carpenter works with Kubernetes scheduling. So it works with the node selector, node affinity, tints and colorations, and topology spread. So if we take a look at an example, we see the node pool here on the left and a workload specification file on the right. So on the node pool, you will see that I'm specifying annotations, label, and tints and colorations and tints. So you can use all of this because any virtual machine that this node pool provisions, it will have these annotations, labels, and tints attached. So on the right, you can see I am scheduling a deployment where I'm using node selector using the label team colon team A that this node pool is using. So you can use this label to schedule parts for different applications. And like I said before, all the YAMLs that I'm showing here, they're not any cloud provider specific. So it works nicely with the Argo CD, right? All it sees is YAML, it can bring in the YAML, and then it applies to the cluster. So we hear this a lot, like Carpenter is great. You are like, okay, this is a lot of talk about Carpenter, but what about the real world scenario? So Carlos and I work with a lot of large customers, and they already have cluster autoscaler running their multi-tenant, multiple applications. So rarely there's a switch, one day they hit the red button and everything goes to Carpenter, right? You would like to roll it out gradually. So apps will gradually move from cluster autoscaler to Carpenter. Both cluster autoscaler and Carpenter can coexist for separate applications. And that's what we created this talk for, because we wanted to show you how you can migrate with these real world considerations in mind. And this process also works for complete move in one go. So this is how the process goes. If you're trying to migrate from cluster autoscaler to Carpenter, you already have node groups. So you can extract the label, tense, capacity type, and architecture, and put it in the Carpenter node pool. And our script actually helps you do all that. And then you can extract security groups, subnets, IAM roles and tags, and put it in Carpenter node class. Remember those two YAMO files that Carpenter uses. So let's take a look at the flow. Let's say your cluster is running cluster autoscaler. You have the node group, which is running cluster autoscaler and core DNS. And then you have the application node group, which is tied to autoscaling group, which is running M5.large instances with the application pods running inside it. And even though I'm showing autoscaling group, if you are not using autoscaling group, this will work with node groups without autoscaling group as well. So now you define node pool and node class, and then you install Carpenter. So you could see on the top on the node group, we have the Carpenter pods running now. And then you define the node pool, node class with the information you have from your node group. And then you set the desired and minimum size to zero. This is all in one go. So as soon as you do that, this existing node group, all the pods will evict. Don't get scared, this respects what disruption budget. So if you have something set, it is not going to evict everything in one go. It will respect that. So at that point, since the node group maximum size and minimum size is set to zero, desired is set to zero, so Carpenter will see there is a lot of pending unschedulable pods, and Carpenter will provision a worker node. And all the pods will get scheduled. So for those of you who noticed, I was a little bit cheeky there, instead of Carpenter provisioning 2M5.extra-large, Carpenter provisioning 1M5.2x-large, because Carpenter is always considering what's the best way to accommodate these pods in a most cost-optimized way. So Carpenter will always do that adjustment. All right, and after everything is migrated, you can delete the node group and the auto-scaling group. Now what if you want to do gradual node group migration? That's fine as well. Instead of setting everything to zero, you set desired mean max to a reduced number. And in that case, some pods will get evicted. Again, they will all respect pod disruption budget. They'll go pending. Carpenter will provision a new worker node, and kubescheduler will schedule those. And then when everything is migrated, you can delete the auto-scaling group and the node group. All right, so now I went through the workflow. Now Carlos is going to show how it actually looks with the actual demo. Carlos, good luck with the demo gods. Demo gods. I've seen all the folks successfully work with the Wi-Fi. So let's switch to one of the scenarios that we use for this talk. And the idea is for, as a platform team, let's see if our workflow is up and running. So as a platform team, maybe you're working with our workflow, so you're familiar with the workflow UI, you create templates. But when you are, we are in the situations with end users, they have organizations and teams that are using this cluster. So it's not in their power to just one day say, next week, we're going to migrate every cluster. Let's say they have 100s. We have folks that have a thousand clusters. Move them right away to carpenter. As a platform team or devos team, you want to create templates, workflow templates that they can have self-serving portal, right? Developers or that team or the organization can come to our workflows and pick or create a workflow out of using your templates to migrate their node groups. And sometimes there's multiple teams in one cluster. So one team is ready to move. Another team, maybe they want to hold off. So you, as a platform team needs to provide itself serving assets and automation. At the end of the day, this is automation, but you want to make it self-serve that anyone, when they're ready, they can migrate with no downtime. And that's the most important thing that some folks don't want to migrate because they're afraid of like, if they have downtime base migrating to a different auto scaling. So in this case, I have two workflows is one to migrate to carpenter. But I also, in the github repo, you can see that I have a different one that is rollback. And there's always a question. That's always a question that comes up, right? And we said, like, you can use this to migrate. The next question that the end user asked, like, what if something goes wrong? How do I roll back? So we wrote an example of how you can go directly from carpenter back to the state that was before you migrated without downtime. So here's an example of that workflow. I didn't use Hera. Hera is another way of creating our workflows. But I just just jammed and I didn't use a DAG. So let's kick it off. So I was submitted. Select the entry point, which is migrates. Maybe this could be in our workflows, or maybe using an IDP, an internal developer platform, something like backstage, or something else that people come in and give you the inputs for the form that you want to launch. In this case, I'll select one of the clusters. If you're doing one cluster at a time and then cluster autoscader, if you want to just, like, reduce it to zero because you don't want to have waste, or you want to keep it running because there might be other teams in the cluster using it. So in this case, I'll take it down and then submit it. And then I'll show a little bit of the ammo. So in this case, I'm using a combination of, I'm using Python SDKs to talk to the cloud provider to get the number of node groups, extract that information that Raj was explaining. So in this case, this function here workflow was getting the output. So the output is a list of the node groups that I found in the cluster. So in this case, I have 10 teams each one with their node group. And then I did a fan out in here. So you can see, let me see how I close this. So this is the fan outs. Demogots are working on my side. So they're migrating. So I'm extracting all the metadata of these node groups. And I want to create equivalent carpenter node pools and carpenter objects. And this is a migration of one to one. Like a lot of end users don't want to like risk it, right? Because these are production clusters that are like handling our taxes and public sector finance. So they just want to move to the new technology and then optimize to get advantages of consolidating of the resources and pick the right instance. But in this case, it's an example of a safe way of migrating. So this function here, let's see if we can find the logs should be here. And then these tasks are running in carpenter. So the other workflows are running in carpenter. And so in this case, I was able to create a node class. So I extracted information like the workload identity, AIM role, the security groups associated to that node group. So that's for networking for our roles. The subnets that are available for that specific node group. The tags that are available in that. And then I created a equivalent node pool. In this case, it says easy to note, but it's a different cloud provider will be their own cloud provider, but it's a node class. And the second object that Raj was mentioning is the node pool. In this case, I used the node pool to save the information of the node group. So if I migrate back, I can migrate picking the resource, the annotations information and recreate that node group that if I want to roll back. The other one is the metadata information about it and the specs. So things like that. I was using AMD and using natural and then the C file large, like the watch was explaining. In this case was a C file large and then the things. So the things is in this case was a scenario, maybe that in this case, team six wanted to have a node group with certain characteristics. And maybe they have a charge back of like, we're paying for that for those nodes. So they have attained to me, meaning that only these workloads land in there. And all that information is abstracted. In this case, I'm doing a apply directly into the same cluster. But I could apply it to a remote cluster. Or if you're combining, this doesn't require Argo CD. That's one of the things that we did. Some cost some end users are not using Argo CD. They just want to migrate. But if you're using Argo CD, you can put this YAML file in Git and then have Argo CD reconcile those YAML files. In this case is putting down the autoscaler. And then the last one is, let's see if we can see the logs in here, basically taking down the node group. So if we look at the animation, we had an animation here that on the left, and the left side was the node groups. And then on the right side are the carpenters. So eventually everything in this cluster will move. So they're still moving. So I had like around, I don't know, 100 nodes using node groups because I disable autoscaler. And then they're moving, they're moving into carpenters. So you can see in the right side here that if you use the consolidation, carpenters only deploy the minimum amount of nodes to satisfy those spots that are impending. So you can see here that I got 29 nodes running here. And all the information is in the GitHub repo if you want to recreate it. And also the assets to do that. But the key aspect is Argo workflows we have seen that is becoming kind of like the automation tool for platform engineering of have been used for a lot for AI and ML. But Kubernetes being 10 years old this year in 2024 is going to the maturity that actually the APIs are being deprecated or removed. So we have seen Argo workflows also being used for cluster upgrades. Upgrading APIs, upgrading the whole cluster API using a managed Kubernetes cluster like in a cloud provider or your own. So Argo workflows is actually a tool that's becoming not just deploying net new apps, but also migrating in place clusters for end users that want to do it in a safe way. And also since it's Kubernetes and kind of the platform engineering trend right now is build all your platforms on top of Kubernetes APIs. So it plays well with other tools that are building using Kubernetes to build platforms in addition of running containers. So that's the demo. It looks like it worked. We can go back to the slides. Let's see where we were. And this is the QR code for the GitHub repo. It's under the Github organization. There are other examples there. The whole example was deployed through Argo CD. So you can just do a, in this case, I think we're using Terraform. So Terraform apply will create kind of a small cluster. The control plane, Argo workflows, Carpenter, all the controllers are running in a ephemeral nodes called Fargate. And then everything else is running in node groups. And then you can run the workflow to migrate. So give us feedback. Check out the GitHub. Want to say last word? Or we'll be here a week? Yes. We'll be on the AWS booth. So come by. We have more Carpenter demos. If you have more questions on Carpenter, happy to answer any questions there as well. Yeah.