 Good morning. Welcome to our session. My name is Olivier Tardieu. I'm a principal research scientist and manager at IBM. And today I'm joined by Abhishek Malvankar, who is a senior software developer at IBM Research. And the two of us are very excited to tell you about our experience, about the lessons we learned trying to use dynamic resource allocation for improving GPU utilization on our Kubernetes clusters. So in this talk, I'll start by briefly talking about motivations in France servers, the workload that led us on this journey toward using dynamic resource allocation. I'll talk about multi-instance GPUs, and then we'll dive into dynamic resource allocation. I'll talk about what it changes from the perspective of Kubernetes users or cluster owners, and I'll introduce Instaslice, which is a piece of code we're contributing to the community that is paradoxically trying to make sure that nothing changes, that will become clear in a second. And after that, Abhishek will walk you through not just what Jira is doing, but how it's working, its mechanics, and what we've been trying to do to improve on the out-of-the-box experience with scheduling latencies and placement strategies. So in the last couple of years at IBM Research, we've been developing an inference service, a large inference service that I think today is ranging over about 100, hey, 100 NVIDIA GPUs. And we use this service as a platform for experimentation with models, with serving technology, and so on. And it serves a constantly varying, evolving collection of machine learning models, some of them very big, with hundreds of millions of parameters, and some of them relatively small by today's standards with just, let's say, a couple million parameters. We've talked already twice this week about this inference service. Here's one of the talk on the slides. There's another one. If you want more specifics, more details about this workload and this service, you can find a lot more statistics in these talks. But from the purpose of these talks, what is of interest to me today is that some of these models are small. They can, they only require a fraction of a GPU to run. And so, of course, in order to maximize GPU utilization, we like to pack many models on the same GPU. And you have different ways to go after that, but this is KubeCon. So today we're going to talk about Kubernetes native mechanism to do that. Hi, he, how can I run multiple containers or multiple pods on a single GPU? For that, I need to introduce multi-instance GPUs, which is a feature of NVIDIA data center class GPUs, like A100 GPUs or H100 GPUs. Essentially, it's the idea that I can divide one GPU into a number of slots, seven today. And if I want to use these seven slots or slices my GPUs as if they were independent devices. But I can also group them in pairs, in groups of three or four or seven, so that I can get the GPU size if you want that is exactly right for my workload, right? And the nice thing about MIG is that I can mix and match. So I can get a two-slot instance with a one-slot instance, a one-slot instance, and three-slot instance. Or I can do something completely different. What's not so convenient though about MIG is that this is not just a virtual partition of the chip. This is really about deciding what to do with physical spaces or areas on the chip. And so what that means, for instance, is if I just want a single slot and I choose one in the middle or one at the end of the GPU, it will make no difference to the workload that is running on these slots. It's going to get the same amount of memory, the same amount of compute and so on. But it's actually going to dramatically change what I can do later with the remainder of the GPU because it introduces different fragmentation patterns with the GPU, right? Now, understanding these kind of resource allocation constraints and how to best make use of those is not quite something that Kubernetes resource management and scheduler can do today, which is why the community has been working on dynamic resource allocation. Not one of the reasons the community has been working on dynamic allocation. Resource allocation, we're going to talk about that. But before I get there, if you like me, you like Tetris. You can think of me as a puzzle if you want. A GPU is this kind of this weirdly shaped thing and you're trying to cover it with these weirdly shaped pieces and you can play this game at home with your kids. So how does this help us? Again, I said this before, it's because if we can fit the model in a small slice, we can increase density, we can run more models on the GPU, everything is good. Of course, as you might guess, performance is kind of a different story because as I reduce the size of the slice, I'm also reducing the amount of compute, the amount of memory I can use for caching and therefore I'm kind of limiting the throughput I can get out of the model. So as a result, the optimal slice for a particular model not only depends on the model size, parameters, you know, a bit width, but also on the loads. And of course, load varies over days because model popularity changes because load again varies throughout the day. And so what that means is we want to do a good job at maximizing density while preserving our service levels, agreement or objective for the service, we have to dynamically vary the size of the slice for, you know, small models. And you could think of doing this as horizontal scaling by having fewer or more slices, but this is wasteful because you have to duplicate the model waste and it's actually, in this case, vertical scaling gives better performance and this is what we like to have, which is why we need a mechanism to dynamically scale, dynamically slice GPUs, making changes for one particular model without necessarily affecting other models. So can you do this today with Kubernetes? Mainstream, you know, stable releases. Kind of, but not exactly. What you can do is you can deploy the NVIDIA GPU operator to your clusters. I'm sure many of you are familiar with that. Now the NVIDIA GPU comes with MIG support. It comes with something called MIG manager and what it lets you do is let an admin go and label nodes on your clusters, select the profile layout for your GPUs and then when you set this label or when you deploy the operator, what the operator will end up doing is actually slicing the GPUs, configuring the GPUs according to specification you've provided, for instance in that case, building these kind of distribution of slices on the GPU and then the nodes will start advertising, matching resources, we call these things extended resources because they're not just CPU memory and then you can request for particular pods one of those slices by just including a resource, a request or resource limits on your pod spec. So this is great. Again, what's great about it is that it just works. It's stable, all the slices are available at the same time because this is a proper partitioning of the GPU. It's actually even support dynamic partitioning because an admin can come in and change the label on the nodes and the GPU operator will do the rest, propagating and making configuration changes on the GPUs in your cluster. What's not so great about it for use case is that there's no partial reconfiguration, no incremental reconfiguration, it's slow. So by no partial reconfiguration, what I mean is if I change this label, all my workloads on my GPU are going to be evicted and I will need to start again. So I cannot make small limited changes on my GPUs such as merging the two green partitions into a blue one if I want. It's also not incremental. If I just get a single pod coming to my cluster and I need to allocate the slice on the GPUs for that pod, I need to decide ahead of time what I'm going to do with the rest of the GPU. So I need to guess what the next workload is going to need which in general, I might get one. And finally, when I do this, I have to wait a minute or two. If I do a reconfiguration the way the GPU operator works is such that I need a minute or two to actually go and observe the changes and be able to use those changes. So this is where dynamic resource allocation enters a picture, or one of the reasons it does. I'm sure many of you have seen the keynote from Patrick, from Kevin yesterday. And so DRA is an attempt from the communities to standardize access to undemanded resources where everything is trickier than what we're used to. The resource description themselves is tricky because for instance, in the case of mid, they combine valid pairs of memory footprint and compute footprint. They have custom satisfiability rules, right? If I want to decide whether a GPU can satisfy a resource request or not, I need to have a deep understanding of how mixed works, what are the conflicts between the different layouts of partitions on the GPU. There's also custom initialization and cleanup because once I decide that yes, a GPU can satisfy requests, I still have to go and configure the GPU so that it exposes exactly the device I've logically given my part. So it's much more than that, it's also about sharing, not just sharing by partitioning, but sharing by having two parts share the same slice and things like that. It's not just for GPUs. DRA comes with concept of resource claims and it follows from persistent volumes, persistent volume claims. So it's about storage, it's about network, it's about topology. There's a lot of potential there again, Kevin and Patrick have a list of six, 10, 11, 12 use cases that you can use this for, this is great. Now the last thing to mention about this though is that this is an alpha feature. This has been an alpha feature for a while and this is going to be an alpha feature for a while longer because this is really important and the community wants to get this right. So as a disclaimer, I should also say that us at IBM are eager to deploy DRA in lots of clusters. We are working to get there and facilitate that, but until the specification stabilizes and asks, we are not going to do that yet. So how do you use DRA? This is how you use the NVIDIA GPU operator. I've seen, I've shown this before. Where you use DRA is you comment out these three red lines of YAML and you replace this with the green stuff. So I know what you're going to say, but before we get there, in fairness to DRA, there's a reason for this abundance of green. Again, DRA is expressive, flexible, powerful. So we have to pay this price somewhere. Now having said that, ouch, right? I don't want to be having to explain to my users that the YAMLs they've been happy with for the last few years, they have to change it and they have to replace every YAML or every YAML generation tool to do the green instead of the red. I don't want to tell them, I don't know if I can point this, that if you look at these slides, sometimes you need a dot between 1G and 5GB and sometimes you need a hyphen and if you get this wrong, nothing is going to work. I really don't want to do that. Which is the first reasons we've been working on Insta Slice. Insta Slice, if you think of DRA as medicine that Kubernetes needs, think of Insta Slice as the sugar coating that we need to make the medicine go down. Insta Slice is a bunch of things. Abhishek will talk about the more interesting, maybe more advanced features of Insta Slice, but to start with, it's all about making you forget the previous slide and you've never seen that. You never think that anything has to change, right? And the way we do this concretely is we've implemented a pod admission controller, a mutating webbook that looks for pod specs and sees a resource, me resource request in these pod specs essentially rewrite them as the equivalent claims, right? So this is available today. This is open source. You can go, you can download it. You can run it in your cluster. You should give it a star, not because it's rocket science, not because you think deploying a pod admission controller on your cluster is a great idea, you know? Hundreds of ways you can kill your clusters and your productivity, but because together we can send a clear message to the community that these kind of user experience questions, these kind of migration questions, these are not second order concern, right? This is really something that the DRA proposal has to embrace, right? How are we going to help our user migrates from the old world to the new world? Is it something for NVIDIA to solve, you know? How do we do that? We have to discuss, but this is something we need to do. And ideally I don't want to maintain this code for the rest of eternity. So before I leave, I really like to do a demo. I think I'm running out of time, so you can go and watch this online, offline, after the talk on YouTube. I'm just going to tell you briefly what you're going to find there, but essentially everything I've talked about so far. In this demo, I, oh, I'm not seeing this. Let me share this. Can you see this now? Nope. Speak please, this is the other screen. Okay, so what you will find in this demo is essentially a demonstration of how you can create a Kubernetes cluster, or you can, on this Kubernetes cluster, deploy and use the NVIDIA GPU operator to slice your GPUs ahead of time. So here, you know, I'm configuring my GPUs so that they are sliced in seven or more in a slice of the same type. I can then deploy the workload, the thing I have on the right, on the cluster. It's going to work halfway because one of the pods, for one of the pods, I can satisfy the request, for another one, I cannot do that because the kind of slice I'm asking for is not available in the cluster, which is why we want the URA in the first place. So what I do like in this demo is we configure the MIGO operator so that you know how to do this, but more importantly, how you can instead use URA, which will then let me deploy the same workload on the cluster as is. And if I do that, if you've been following carefully what's going to happen, well, nothing, because URA doesn't understand extended resources. But finally, I can combine URA in the slice and Eureka, I get the best of both worlds. I can use the same workloads as before, I can deploy it on my cluster, it will now run, and instead of having to create GPU slices ahead of time, I'm going to create, the combination of Insta Slice and URA will create the slices that you need exactly the one you need exactly when you need them, right? That's, I think, concludes my part of the talk. And I'll try to bring back the PowerPoint here. Thank you. Yeah. Thank you, Olivier. So definitely Olivier took us through a journey of the device plugin way or the GPU operator way of doing things and that now we drop down to the DRA, which is certainly the new way of doing things. So let's learn, maybe this is a refresher, so let's learn a bit concepts about DRA's initial implementation, which worked in tandem with the scheduler, and the interaction happened via a new object called Pod Scheduling Context. DRA has or had two modes, and we'll go through that. So wait for first consumer mode. We believe this mode is a lazy mode. So if you do not know what resources your workload or pod needs a priori, then this is a useful board. And what we mean by that is let's try to understand this with a use case. So consider technologies like CXL, right? And these technologies help you attach resources that are not local to the node and you can attach these resources on the fly based upon what the requirement of a pod or a workload is. So certainly in those scenarios, wait for first consumer plays a good role and is the mode to be working with. The second mode is immediate mode. This is much simpler and eager mode. It does soft allocation per se, and this allocation happens before the pod arrives in the system, in our scenario, or the system that we design. And we'll learn a bit more about soft allocation in the demo. The reason immediate mode works for us is most of our workloads are inference here in this cluster and all they ask is GPUs or a mixed slice. And these are node local resources and we know a priori, we can inspect them and we know what the pod is requesting. So that's why for our use case, immediate mode flies. Okay, so we sought two DRA modes. They both have overheads. So let's try to understand overhead between both the modes. In wait for first consumer mode, overhead comes from multiple interactions between the controller and the scheduler via pod scheduling context. In immediate mode, we interact with the pod scheduling context just once. You can see as illustrated in the workflow chart, the first picture here shows wait for first consumer mode. That does a back and forth with the scheduler and the DRA controller to find a suitable node. While the second picture, which is the immediate mode, just consumes the placement that is already existing in the system. So in comparison, immediate mode has less overhead. To enable immediate mode, placement and resource check becomes key to provide valid allocations. Hence, now, further, since we have talked about placement as being important, let's first define what do we mean by placement. For our use case, placement is finding a node and a GPU and then provisioning a mixed slice on that GPU is what we define as placement. So with make unable hardware, what we see is fragmentation is easy due to the constraints that it puts on. So the hardware has a constraint that vertical overlapping of slices is not possible. On the contrary, if you do some sort of smart placement, it can help with fragmentation and optimize GPU utilization. Let's try to understand this with an example. Assume you have a queue somewhere and you have two workloads requesting 3G slice and a 4G slice in that particular order. According to the hardware supported dimensions, placing 4G slice ahead of 3G slice almost always makes sense. And that's right to left placement which we define in our system. To achieve such placement, an external entity typically would be needed that provides such better placement decisions and works with the existing queuing system in the target environment. Hence, connecting the dot between our learnings and the use case, we now present another feature of InstaSlices, the placement. That works with the DRA controller via immediate claims. InstaSlices selection at this point is first fit algorithm. Placement of workload happens right to left almost all the time for our use cases since we are going after packing optimization. Future feature of InstaSlices would be to enable auto scaling via possibly seeing pending claims using machine sets which is an API provided by the cluster API project. Below which are shows the design and the interaction. So a high level overview would be InstaSlices controller sense placement information with claims to the DRA controller. DRA controller updates placements in its NAS object via immediate allocation. Cube scheduler reads the pod scheduling context to bind pods and finally workload starts running. InstaSlices also manages the life cycle of the claim with the workload meaning if the workload is terminated or completed InstaSlices would clean up the previous claim to make room for the next claim in the system. The whole setup has been running on OpenShift for us and we share with you a few pointers on the installation process. We would like to thank Vitaly from Red Hat for helping with the enablement. We would also like to share a few gotchas that we encountered when building the system. To summarize, we faced hiccups with consistency, orchestration and performance for our setup and we list them out so please reach out if you need more details on this. Okay, now the demo time, right. Welcome to InstaSlices demo on OpenShift with the DRA enabled. Let's quickly see the console URL and you see. Oh, all right, thank you for that. So what we see in the cluster here is the cluster URL and once the cluster has been set up we would also like to show what operators we installed to enable DRA. So first what we see is the NFD operator which is used to label accelerator enabled nodes and then the GPU operator is used to install GPU binaries on the target nodes in the cluster. Now we see the number of nodes in the cluster for this simple setup. We have single node OpenShift cluster which operates as a control plane master and the worker. We see that the InstaSlices controller is running in the namespace InstaSlices system and we also need the NVIDIA DRA driver. Now let's see the NAS object for a clean slate cluster. Trust me this works, this is a video so I think it's a playback speed issue. So we see the NAS object here, the status here is ready. What we also see here is the MIG partition that is exposed by the NAS object for an A100 40 GB device. The interesting thing over here is this device has been managed by the NVIDIA DRA plugin driver and obviously this is the node allocation state object or the NAS. Let's see if the cluster has any resource claims. So yes we don't have any resource claims so we start with clean slate. We submit a new kind of, first kind of resource claim which is wait for first consumer claim and it created a bunch of objects, specifically GPU claim parameter that tries to acquire the GPU and make device claim parameter which will realize the MIG on the acquired GPU. Now remember this mode is lazy, right? So we do want to see what's the status of the wait for first consumer claim. So by this command we would see the status of the wait for first consumers claim and no surprises there. The state here is pending and the reason being is there's no pod in the system, right? So it's lazy, it will do nothing, just create the claim. Okay so let's verify this by now again reinspecting the NAS object and we see that the NAS object at this point is just ready and it's clean slate, okay? So hence we prove that wait for first consumer is lazy according to our understanding. Now let's delete the claim and start with a clean slate system. Now here we submit something else. We submit here a regular pod or a workload and a bunch of things will happen. Insta-slice webhook will mutate the pod, will create the desired resource claims and also provide placements. So let's see what kind of claims did Insta-slice controller create for this pod submission? So let's see. So we see now a different kind of claim that is created by the Insta-slice controller which is immediate and it's in the allocated state. Okay, what about the placements, right? So you can see the placement in the label section. So we have all the placement information in the label section which is consumed by the DRA controller here to make the make placement on your target GPU. Let's see the NAS object now. Did it change? So NAS object now has two new sections, allocated claim which we call as the intent and prepared claims which we call as the realization of the intent. So both of them are in sync. This just means that the partition has been created on the target GPU. Let's try to view the logs. What does the workload see? So no surprises there. We do see that the workload is seeing the correct GPU with the correct mixed slices. Now let's do something interesting. Let's submit few more workloads and see incremental make creation happening in process. If we compare this with the device plugging world without any reconfiguration, this won't be possible. But with DRA, this enables you. Let's again, reinspect the NAS object now and see what happened when we submitted more workloads to the system. What we now see is that the prepared claim sections has lot more content or meaning lot more claims. And usually the prepared claim section will play a catch up game with the allocated claim section and soon after they both would be in sync. Let's see now what's the status of the resource claims. We do see that the resource claim now have been moved from allocated to reserved meaning some of the pods in the system are consuming those immediate resource claims. Let's do something more interesting now is try to incrementally delete pods, not claims. We are trying to delete workloads. And what Instasize Controller is doing is it has a relationship between workload to the claim. So it's trying to clean up the claim to make room for the newer claims. And if we see the resource claim now, we see that no resource claim exists because there are no workloads in the system. If we see the NAS object, the NAS object again is clean slate. So Instasize manages the claim lifecycle with the workload. And in this demo, we show how Instasize can influence decisions of the DRA controller. Okay. To summarize, DRA is the future and it is still in progress. We have seen and learned that Instasize implements extended resources API on top of DRA, improves scheduling latency and placements by adopting immediate mode, provides placement from right to left. For initial DRA implementation, one key requirement what we observed is exposing an API which could be consumed by external controller like Instasize to influence DRA placement. I hope you enjoyed this presentation. Thank you for listening. Before we take questions, just a couple of comments on the demo you've seen here. We have two demos here because the first one is just using plain Kubernetes and using the plain upstream DRA driver from NVIDIA. And you can find all the details, the scripts behind the demo on the InstaScale repository and you can therefore not just replay the demo at home but you can recreate it at home if you want, assuming you have a 100 GPUs in your basements. Call me if you do. The second demo is an OpenShift demo. It shows more advanced features that require changes to the DRA driver. These are things we're working on publishing with our colleagues at Red Hat and this should be available shortly. Thanks. Yuen Turn from NVIDIA, yeah. Kevin's calling. So it's great to see the improvements and enhancement in the mix. Very nice work. So I just wondering and can you comment and compare and propose the dynamic and the slicing with other existing GPU sharing mechanism like time slicing and multi-process service and do you foresee any use cases in your practice for them? Thank you. Yes, of course. I don't want to give the misimpression here that we think that MIG is the one and only way to do slicing or sharing for inference. We choose to focus on MIG in this talk because it's really the poster child for DRA, but we're also considering time-sharing MPS for inference and we think that, you know, we probably even think that time-sharing is in the end most likely more valuable than just being for inference. Okay, great. Look forward to see more use cases for all days. Thank you. Yeah, great presentation as well. Thanks a lot. I'm curious, your clusters are fixed size. Do you have any interoperability with node claims for? Yeah, currently in our scenario, we are working on fixed size clusters. I mean, I'm not sure if you're alluding to auto scaling. Yeah, I was thinking about carpenter and node claims. Is there any interoperability between that and the GPU claims? I mean, the thing there is, I don't think carpenter is there yet in understanding the dynamic claim creation and how the information gets propagated. We are in talks with one of the carpenter maintainers yesterday and there is work to do on understanding the mix. So right now, I think neither us nor them are ready, but we are in talks to make a platform. It seems like there's some confusion between you're claiming a GPU on the one hand, the auto scaler's claiming a node on the other hand. How does that interoperate, I guess? Yeah, I mean, we need to figure a strategy out. Currently, we don't have one. Okay, yeah. Thanks a lot. And this is also something that the next revision of the array with structured parameters should make it more less opaque and therefore more amenable to have the auto scaler understand those things. Oh, yeah, good point. Yeah, thanks a lot. Hello, thank you for the talk. It was very informative. I wanted to ask, did I understand correctly that Instaswise is actually doing an injection, you might say, so mutating webhook that would update the information that you showed for the NVIDIA driver, which is, yeah, very long and hard to understand. Yes, that's correct. That's what it does at this point in time. It's specific to the NVIDIA. It essentially has a table that says, this is what the slice is called. If you're using extended resources, this is what the profile names for the claims. This is the equivalent GPU parameter that you need if you're using the array. So it should be possible to extend. I think as I joked around also in the talk, I'm not convinced that long-term, I don't want to convince my Bradat colleagues that they want to deploy put admission controllers on all their clusters, so I think we need better solutions. But at least that's a stopgap solution, and the point is we need this, right? We need this to be supported by the framework out of the box. Yeah, it's a good idea, definitely, especially for someone that doesn't want to look into every tiny detail, but in the event that you do have to do some optimization to the NVIDIA one, does Instaswise allow it, like with some sort of annotation? Yeah, so I mean, from a code perspective, we are trying to open source the placement logic, and there is an interface. You can hook your custom policies on top of that interface and have your own custom placements, whatever you want. So that's in progress. We haven't released it because the new DRA version came in just yesterday or two days ago, so we are still in flux to decide, yeah. Okay, thank you. Hi, this is Aluk from Red Hat. So thank you for the great talk. So I saw you have webhooks, and the scheduler is now watching for new resources. This means I want to understand first the impact of webhooks on this architecture. Second, if we need to extend the scheduler to watch for new resources apart from pods, no? Okay, and... So maybe this is a cordier question. So DRA relies on extensions of the scheduler where it now understands claims. That is something they don't have if you don't enable the dynamic resource allocation feature on your scheduler. So there's definitely an underlying change in the scheduler that we leverage here. So the first part of instance, the webhook, indeed, is not changing anything in the scheduler. It's something that sits before the pod, which is the scheduler, and make the necessary changes to your pod spec, essentially. Okay, great. And then the second part, as you mentioned, go ahead. I mean, you mentioned change in the scheduler. I mean, pod scheduling context is a new object that the scheduler consumes to get the placements. So that's a change implicit in the initial version of the DRA. And that was alpha feature upstream. So that's the change for the scheduler. All right, and then the second part is, you mentioned cluster API. I didn't get the context. So autoscaling here, we are bouncing around ideas. You can, so currently the autoscaler works on pending pods, but we believe autoscaling could also work on pending claims. And once you know the claims, then all you can do is bunch them together and ask for a node using the KP project. And then somehow stuff those claims on that node so that pods could attract to this node. And that could be one autoscaling policy. And the reason we say this is because, I mean, if you go the traditional autoscaling way, you would have to rebuild the binary for your own custom scheduler and other things. So it's a long path. So there are multiple solutions. It depends what you want to explore. All right, thank you very much. Mike, there's no more questions. Thanks again. Thank you.