 I think we are like a couple of minutes ahead of schedule, but I think we'll get it started and we get some extra minutes for questions if there are any. So the next talk is by Roland and Apoorva from AWS, and they will be talking to us about constructing Kuflo supernodes with Carpenter. Welcome them. All right, welcome everyone. How's everyone doing? How many people have had a Chicago pizza already? That's it. Luma Nadis, that's not real pizza. New York pizza is real pizza. No, I'm just kidding. Had to go there, sorry. My name's Roland Barcia. I'm the worldwide director in AWS and our specialist organization leading all things containers and serverless. I have the pleasure and the honor to be joined by Apoorva. He's gonna answer all the questions that you have. He's one of our lead specialists around the world helping customers build Kubernetes applications that run machine learning models with technology like Kuflo and other things like Ray. And so I'm gonna get us started a little bit talking about some concepts around Carpenter. Gonna hand it over to Apoorva. Talk a little bit about Kuflo and then he's gonna show some demo around running some of these machine learning models using Carpenter to scale. So how many people here are running machine learning models on Kubernetes? Let's see, how many plan to? All right, good. So a lot of challenges around this space, right? We went from a technology that was really started around running and building microservices and stateless applications in Kubernetes to now running a lot of stateful applications that deal with large volumes of data. And so when we're talking about scaling requirements around machine learning, it's not just CPU and memory, right? You have to deal with storage, the thing backing the data, how to make that resilient, how to make that scalable. You have to think about networking performance sometimes. The data's not resident or close. It needs to be streamed in. You need to run other streaming technology on top of the cluster, on the side of the cluster. And then those always available GPUs that are there at your disposal, right? Always accessible, but that's at a commodity these days where people are trying to figure out how to do things like slicing GPUs or sharing GPUs. And so you have to worry about these things. And you have a lot of different characteristics, right? I think it was in the room next door, the multi-tenancy con, right? Everyone wants to share clusters. Sometimes they do, sometimes they don't, but they definitely want to be able to have shared automation and shared resources. And so from our AWS customers using Kubernetes, about half of them really struggled with a lot of these scaling requirements when we're talking about cluster autoscaler. The reason they're struggling is because they are now getting very specific around requirements, right? They're thinking about things like, I want to switch instance types for these workloads. I want CPU intensive. For these, I want instances that have GPUs. There's all these new machine types that have machine learning characteristics or GenAI characteristics to run inference. There's also the notion of cost optimization. When the year started, everyone, before the GenAI moment, everyone was thinking about cost optimization. How do I use spare capacity, things like Spot in AWS or other clouds have their notion of spare capacity? How do I use specialized architectures like Graviton and ARM-based instances? So all of these things came into play and then we have the GenAI moment. We have tons of people now thinking about machine learning, running existing large language models, and how do I do that on Kubernetes? There was a lot of challenges around also multi-AZ availability, lots of different challenges. Hence, when we're talking about Kubernetes, we're really talking about pods that get scheduled on nodes. You have pending pods. And before you had something like Cluster Autoscaler working with something like a scaling group in AWS or whatever cloud component. And now what we want to do is consolidate orchestration with a single system around Carpenter. Just a note, the IG Autoscaling SIG has agreed to add Carpenter to the SIG. And AWS is in the process of donating Carpenter to the CNCF. So this is something that we're working on from a community perspective. So let's talk a little bit more about Carpenter. But really, Carpenter is trying to solve the problem around how do I get a node that when I run out of capacity with my existing nodes, how do I get a node up with the best instance type up as soon as possible, right? So Carpenter is really designed around getting that note up but with very dynamic requirements. And so for, and I'll give an example on the next slide, but the focus here is then, there's a thread on the foreground that's saying, I need, I have a bunch of pods, they're pending. I need them to be scheduled. I can't find a node to run them on. So I'm gonna look at a rule and get the best instance with the best cost performance as possible, right? That's a goal. Then there's another thread in the background that's doing optimization and doing things like node consolidation, looking at nodes to see if there's lots of different empty capacity and kind of optimizing things. So there's a secondary thread in the background doing optimization and I'll show some of that in a moment. And so hopefully you could see the text there, but when we're thinking about Carpenter to give time for a porva to talk about Kooflow and the demo, there's a lot of cool things that you can do. So this is what's called a provisioner. This is kind of the main construct that you create when you're using Carpenter. And you can do things like, hey, give me a provisioner that has a pool of different machine types, so you see C5, M5, R5, so different values. You could really go and say, hey, I want machine types that has GPUs, but they're not available using something else like Tranium or another type of machine type. You can get very dynamic with the behavior and kind of say, oh, I just want a machine with that type, it's very dynamic. You can have other types of rules like the secondary rule there is, hey, I want to exclude small instance types. So I don't want any nano micro small large. You can reverse that and get to smaller instance, but you can say launch me extra large and above. So that's kind of another thing. The other thing is capacity type. So we have reserved instances on demand instances, spot instances, which gives you great savings in a lot of scenarios, up to 90% savings, where you're using spare capacity in exchange maybe for some SLA. You might have some workloads like that and you're able to say, hey, something that's on demand or spot, it'll grab the best performance instance like spot. And if it's not available, it has cool things to handle spot disruption. So if AWS or your cloud claims back the spare capacity and we let you handle the event like you want to check, mark your data, et cetera, and have another instance type come up and take over. It has cool things like architecture. So if you have multi-architecture builds, AMD and ARM, for example, in AWS we have something called Graviton, which are specialized hardware, ARM based instances. It can detect that you have a multi-arch build and you can ask for the machine type that matches the image. So there's a lot of cool things that you can do from a provision or perspective. And so the other cool thing about that is that it's standard Kubernetes pod mechanisms, node selectors, node affinity, taints and tolerance, topology spread. It respects all those rules. So you can have workloads that run in certain AZs and target them. You can have certain types of processors like GPUs or Graviton. You can have certain architectures, but you're building a Kubernetes object that's kind of respecting the rules around Kubernetes. And it's using a CRD so it can be part of your GitOps process. Some of the talks you might have heard with Argo, with Pristina and Carlos over there, if you want to talk to them, really cool stuff there. And then the optimization thread is running in the background. This is a cool graphic that Raj over there created where you have here a bunch of different nodes with pods and you see some of the nodes are pretty empty. And so what you want to do is if you have this feature called consolidation equals true, you get this running in the background and you can kind of time it with seconds after empty to kind of go and consolidate the nodes to save money. And so this is something that a lot of our customers really like. It can also do things like, and you can do these things with node autoscaler with a lot more configuration, but it can do things like, if I'm on an extra large instance, I want to go down to large and it can have behaviors like that to kind of save money. You could scale down, you could scale up to run machine learning. And so Carpenter is kind of a key part of this architecture. We're really excited. It's going to be donated to the CNCF. And now I'm going to hand it over to Aporva to kind of talk a bit about Kubeflow and then show the demo. Thanks, Aporva. All right, so I think a lot of people might be familiar with this, but just in case you're not, this is what Kubeflow looks like at a higher level. Obviously, there's a lot more to it than what's depicted here. And this, by the way, comes from a medium block post by Mikhail Brees that did on Kubeflow. So it has notebooks. It has components that you can run for training your models. It has feature store now, hyperparameter tuning. You can run pipelines. Under the hood, it runs our workflows. You can track your experiments. So if you're a model using K-Serve, Selden was the session before this. That's another thing you can, K-Serve models with as well, and other things. So a really popular MLops platform for communities and because of, if you run Kubeflow, you can run it across clouds, on-prem, et cetera. And so for the demo, this is the architecture that I'll be talking about. So as a platform engineer, I wanna make sure that my data scientists are able to run their experiments using Jupyter Notebooks. But I also wanna optimize for costs, right? And so Carpenter provides that ability for me to be able to do that, which is chloramines and consolidation. So term things on like consolidation. Nothing should be running. If there's nothing running on the node, then I don't want to use that node anymore. Just terminate it. Get rid of it. I don't wanna pay for it. So when you're running Kubeflow and you want to enable Carpenter with it, so you're gonna need a couple of things along with Kubeflow. So number one is run the Carpenter, deploy the hunt chart, you get the controller running on the cluster. And then if you want GPU instance steps, you would need something like GPU operator. Other, you know, when each GPU vendor has their own kind of something demons at running to recognize the accelerated hardware and register that with the Kubernetes control plane. And so you need something like NVIDIA GPU operator. And data scientists come in, they log into the Jupyter Notebooks. They don't necessarily need to know what's running under the hood, right? All of this on the right side of the picture is kind of what the platform operators have to deploy on the cluster. And we'll talk about data on EKS a little bit later, but basically you can write Terraform Blueprints or CloudFormation or whatever your infrastructure's code tooling choices and deploy that. And then you would have these Carpenter CRDs, provisioner and node templates, which basically match to the type of instances that these CRDs would be responsible for provisioning. All right? So I jumped the demo here real quick. So here I have a Kubernetes... Go ahead. No, no. So here I have a Kubernetes cluster with EKS, and I have two nodes in the cluster. And right now, these two nodes are used to run, basically, it has Kubeflow running on it, as well as, you know, Kubeflow with Kubeflow you have STO. These are all the different components that are deployed when you deploy one in a Kubeflow. And then I also have the NVIDIA GP operator on it, so the GP operator has even sets that basically registers the nodes that have GPUs. And then I also have Carpenter with the Carpenter controllers running on it. All right? If we look at the config map for the GP operator in the config map, so when you deploy a GP operator, you can pass in an optional config map which tells the GP operator how these GPUs should be shared if you want them to be shared. So here I have a configuration for NVIDIA A100Gs to use a MIG, which is a multi-instance GPU sharing enabled. For this demo, I won't be going into details of that, but just wanted you guys to see that out. And then I have a default configuration for NVIDIA A100Gs. And basically, I don't want to share my GPUs if it's NVIDIA A100Gs. That's what the config says there. And then another one for NVIDIA A100Gs to use time slicing. So in this scenario, if I have a node label, a specific node label which I'll show later, I want the GP operator to be able to create four replicas for those GPUs, for each GPU. And so for every single GPU, my pod would be able to have, my pods can use four GPUs, essentially, out of that single GPU instance. And so how that maps to carpenter is what we'll show next. Yeah. So here I have three different provisioners. One is a default provisioner, so if there is nothing configured as far as constraints go, then I want carpenter to use the default provisioner, which is basically get a node from any of these instance categories. It can be spot or on demand. It can be an AMD architecture, AMD64 architecture. So it's basically my default provisioner, which I want carpenter to fall back to, if none of the constraints map to a particular provisioner. Next is a GPU default provisioner. So in this, it looks similar to the default one that I have, which is use spot or on demand. But here I'm only passing in the instance category G and the instance types of G5s. So this is in AWS, it maps to NVIDIA 8 and Gs. And next is a provisioner for NVIDIA time slice. So here, again, it matches similar to what we saw before with the GPU, so same as before, spot and on demand. Use the instance category of G with the G5 instance types. But here I have an additional constraint. And this is the key. So here I have a constraint that says that any time a pod has a spec which maps to a node label or a node affinity node selector, which is nvidia.com slash device plug and config with the value of nvidia 8 and G dash ts, map that to the time slicing configuration that I have provided to the GPU operator. So this is how Carpenter knows that I need to use this provisioner to be able to do the time slicing. And so what Carpenter will do, whenever this provision is triggered and the node gets triggered, is to apply this label on that node and with that value nvidia 8 and G. And that is what the GPU operator sees to apply that configuration for GPU sharing. So let's see how that works. So here I have a Kubeflow UI. I'm going to spin up a notebook. And basically, this is something that a common use case for data scientists to be able to spin up a notebook. This particular notebook is going to be using four GPUs. And so selecting an image, which has a CUDA driver built in, use four GPUs of nvidia type, and launch it. So as soon as I launch it, Carpenter is going to see that the notebook pod is going to be in pending state because there is no GPUs available in my cluster right now. So it's going to launch an instance with a requirement that has four GPUs. And so it's going to, if you look at the details of the node, it's created. So you already see the GPU operator running on it. It's going to bring up a node with four GPUs, schedule that pod on that particular instance. And so as a data scientist, when I come in and I check how many GPUs do I have, run the nvidia-smi command, I'm going to see that four GPUs. So in this scenario, the configuration that I've applied is the GPU default, which is no time slicing, no make nothing. OK. Next, I'm going to create a pod with one GPU, same as before, pick up image with CUDA in it, one GPU, nvidia-vendor. But in this scenario, I'm going to use an advanced configuration for an affinity config. And I'll go into the details a little bit later, but basically this is where I'm telling this pod to be able to use time slicing. So again, Carpenter sees there's a pod pending with a required one GPU. We check the Carpenter logs. Here we can see that it has provisioned a G5 2x large, which is a node with one GPU. It's figured out that it has capacity and spot available, so let's use that. And now you can see that because we have passed an affinity config for the time slicing configuration, it has triggered the node creation using the GPU TS provisioner, which was the one that we are using for time slicing. And then one GPU. So now if we see what needs to happen in a pod spec for the notebook, where how is it that you can configure a pod, a notebook pod to be able to use that affinity config? This is where we kind of get into the weeds a little bit. If you are not familiar with how you can configure that, I encourage you to check out Kubeflow documentation. But in a nutshell, you basically pass a config map with the required pod specification for your notebook pods. And this is where I'm passing in an affinity config, a node affinity where I want that label to be applied. So nvidia.com, do as plugin, and use that NVIDIA 810 GTS value there. And this is how Carpenter knows that there's a pod coming in with this particular constraint. So now that I have the instance with one GPU enabled, time slicing enabled, we should be able to see that NVIDIA device operator has recognized that I need to apply the time slicing configuration, which means that I need to have four replicas of the single GPU available for my pods to be able to consume that. This is what the GPU operator does in the background. It's going to apply those replicas on that instance. And now I can schedule four pods on this single GPU instance or requiring a single GPU. So all of these GPU 1, 2, 3, 4, each is a notebook pod with one GPU required each. And because of time slicing being enabled, all of those four GPU pods are being launched on an instance with single GPU required. Next, I'm going to show a demo of how something like this can apply to a pipeline as well. So if you're running Qt4 pipelines in your cluster, standard pip install of the KFP, I have two tasks to basically contain the components. One is a GPU task and another is a CPU task. In the GPU task, I do the provided example for the vector add calculation. It's just an example to test GPUs. And I just do a random amount of sleep in it. And then for a CPU task, same thing just to sleep, just to emulate CPU and GPU tasks. I also have a paddle task. So this is where I want to be able to spawn four pods each requiring a GPU each of NVIDIA type. And then basically compile that into a pipeline. So if we look at the pipeline code, it's a small CPU task followed by a single GPU task followed by a large CPU task. In the large CPU task, I have a requirement where I need 16 CPUs, just a random number. But I want carpenter to provision an instance which can basically accommodate this large task. And then a paddle task followed by another medium. So basically just a chain of activities. But for each task, carpenter is going to figure out what is the node that I need to provision to be able to accomplish this particular task. Provision that on demand. So if we run this pipeline now and it goes through to completion, we are going to see nodes come up and down. It's going to spin up nodes as each activity is triggered. So a CPU task gets triggered because there is a capacity available on my existing nodes. Carpenter is not going to provision any additional EC2 instance of this. Next is a GPU task. Now again, I don't have any capacity available in the cluster. So carpenter is going to provision a GPU instance and basically make that GPU instance available. So if you skip a little bit forward in the interest of time, carpenter is going to keep bringing up nodes and taking them down as these activities complete within the pipeline. And at the end, when there's nothing else, the pipeline is finished. There's nothing else running on these nodes. Carpenter has taken those nodes away. You're not paying for these instances anymore. That I'm going to back to learn for our conclusion. Yeah, so I think we have a couple of minutes left. Just really quickly, we have a project called Data on EKS where we have a bunch of different infrastructure as code patterns where you can download patterns like this and other machine learning, streaming, spark processing, and other such use cases that we've done on Kubernetes. So it's a project that we have with infrastructure as code. Template some best practices, some performance benchmarks, and some reference architectures. It's out in the open. You can take a look at that and the various areas, AIML being the forefront. We have a lot of different links when we share the material around carpenter as well as doing different GPU instances, Jupyter Hub platforms, and other machine learning environments as well. So some links here for you to look at. You could rate the talk here if you liked it. If not, you don't have to. And then that's our LinkedIn if you want to ask us deeper questions. I think really a minute and 45 seconds on the big clock over there, which the table is blocking for our short people. Yep, so a minute for questions. If not, we can chat on the side. We'll be at the AWS booth tomorrow during Kubecrawl and during the day as well if you want to ask more questions around some of the machine learning work that we're doing on EKS. Thank you. Thank you very much for the great presentation. It looks like there's already a question. Go ahead. Really curious about what the upgrade story in terms of carpenter is when doing control plane upgrades and how nodes kind of get refreshed during that? Yeah, we actually publish some best practices around that. We're doing some work also around carpenter with EKS. We can kind of ask offline if you want to talk a little bit roadmap-y stuff. But we do have patterns to help you upgrade and some best practices. Raj, raise your hand over there. So he runs my carpenter focus group and he has a lot of the best practices there. So ask him some more details on where to find some of the best practices with control plane upgrade and upgrading the framework as well as understand the question. Yep. Thank you. Yep. Hello. You have a very good topic here. During the demo, I saw you set an example about multiple GPUs there, one for an MIG instance there. I'm wondering when do provision, do we need to pre-configure it or the carpenter support automatically schedule it? No, carpenter does not understand MIG today. So we are working with a few workarounds on that for the moment. We're really waiting for something in upstream with something like DRA to be the mechanism that we can use going forward. But for now, it's a matter of scheduling a low priority GPU part, let the GPU operator do its thing, and then have additional part getting scheduled on that use that can use the MIG unit. OK, sounds good. Yeah, just one more question here. I saw you set a consolidation feature there. That's very cool. Like move around the part from one node to the other. But like for the node itself, does it support a vertical product autoscalar? Or no, it's just still similar. So it has the ability to see that an extra large node can be shrunk down to a large node. That's what you mean at the node level. OK, guys. Yes. Thank you. All right, I think we're out of time, so thank you, everyone. Thanks, everyone. Thank you. Thank you.