 These slides may be familiar to you if you have been in the keynote this morning. I can go through them in two and a half minutes, but I think you are all here to get more information about dynamic resource allocation. So we will do a lot more talking, go more into details when we could in the keynote. My name is Patrick Oli. I work for Intel. I'm one of the key architects behind this new feature that you have been hearing about. This is Kevin Kries from NVIDIA. He's one of our biggest users of it and also contributing to a lot of the content, the example drivers. So he will talk about how to use this feature with GPUs, which is what you all care about if you're running AI workloads. So let's talk a little bit about what DRA does, how it's designed. It started out as an attempt to overcome the limitations of the device plugin interface. It took inspirations from the volume management. If you're familiar with the volume handling in Kubernetes, this new API is very similar. We have something called a resource class that can be created by an administrator. Users create resource claims. These resource claims get matched to specific hardware. And then in a pod and in a container, we reference this resource instance to give a container access to that allocated hardware. We designed it so that a resource can be local to a node. That was what you could be doing with the device plugin interface, but it can also be network attached. Then I mentioned the API allows flexible sharing between pods and containers, just because it does have a standalone concept of something that you're referencing. And then we also added a new concept for defining parameters for your resource claim. These parameters are affecting the scheduling. You may request something of a certain size, but you can also, in the same object, define parameters that configure your hardware. And that becomes relevant when we have a complex piece of hardware that needs to be initialized in a certain way in a vendor-specific format. So I've already mentioned vendors once now. The key point really is that DRA in Kubernetes is just a framework. And it enables hardware vendors to extend Kubernetes by writing DRA drivers. And these DRA drivers are then responsible for the hardware, and also for the user-facing interface. How to specify parameters depends on the hardware that you are asking for. It has been an alpha for a while, and we've been trying to get it to be there. I got some feedback from the community and to clarify why that has been controversial. We look at a little bit of the background, how it works at the moment, or how it was originally meant to work. The original idea was that we wanted to keep logic and knowledge about the resources out of Kubernetes. It seemed like a daunting task to teach Kubernetes how to handle arbitrary hardware. So we tried to hide that behind that interface that the vendors implement. The DRA driver controller basically takes those parameters that it gets from the user, and it has its own way of tracking its own resource availability on a node, and it has all the matching. And the Kubernetes scheduler gets involved. They both need to coordinate between it among themselves because the scheduler is the part that knows about CPU and RAM, the traditional resources, and the DRA vendor or driver knows about resource availability for the more complex hardware. That was the idea originally. The problem with that is that we have to have some kind of communication protocol between the scheduler and the driver. That is this pod scheduling context in the middle. It is a built-in type where currently the Kubernetes scheduler stores some information. I think we have, no, I think we may have to, I'll explain it just based on these slides here. The idea is that the Kubernetes scheduler basically lists some suitable nodes, stores that in the pod scheduling, two slides down, okay, well, then I'll get to that later. This concept is what we've been talking about at previous Kubecons. There is a talk from Kevin and Alexei, a colleague of mine, about building such a driver. We have a repository available that gives you a skeleton framework of an example driver that's actually functional. You can run it on your local development machine, runs in kind. You bring up a cluster that simulates having some GPU. So you can all see this in action on your local machine. We do have a resource class, the resource claims that are already mentioned. So you can see how it works with this approach. This is a more complete picture of the communication between the Kubernetes scheduler. It goes through the API server, but it was one other idea that we had, but deploying the array driver should be fairly simple. You just need to connect it to the API server, and that's all. No need to configure webpokes or anything that may be cluster specific. It could have been standardized like that and run on any arbitrary Kubernetes distribution. The drawback is that there is a lot of back and forth when scheduler and driver need to coordinate. It starts with a Kubernetes scheduler identifying a port that needs to run. It looks at nodes that it finds suitable. It dumps that information into the port scheduling context. Then the array resource driver looks at that. It replies back, okay, well, these nodes here don't work for me, try something else. There are some back and forth, and that's kind of where the name of this dynamic resource allocation comes from. Eventually, the Kubernetes scheduler can be relatively sure that it has identified a suitable node. It tells all the drivers, it could be more than one. More than one driver can be listening to this port scheduling context if you have multiple resources from different vendors. That also works. They all just need to agree on a node, then allocate the resource that gets recorded in the resource claim status, and then eventually the Kubernetes scheduler knows, okay, now I can run my port. The advantage is that the port never gets scheduled unless it's fairly certain that it really can run on a node. You don't get into a situation where it gets scheduled and is stuck on a node and blocks resources, which is often sometimes that has happened in the past with the device plugin interface where people just try to extend it and then found at runtime on the node but the setup wasn't quite right. The problem with this approach is that the communication between scheduler and DRA driver is not something that works for other Kubernetes components, most notably the cluster autoscaler. The cluster autoscaler operates purely in read-only mode. It sees what the current cluster state is, which ports are not running, and it tries to determine through simulation whether adding a node may help to get the port running. But it can't ask the DRA driver, we don't have an interface for that yet. Not in the virtual design, there were some ideas about plugging vendor logic into the cluster autoscaler binary, but it would have implied recompiling that binary, which is often not an option, or doing remote procedure calls, which are also difficult from a performance perspective. So in the end, after discussions with the maintainers and the community, we came up with this idea of built-in parameters. This takes the same ideas. We have resource claim parameters, but they now have a built-in type in a format defined by Kubernetes. The corresponding resource information uses the same format, the same model, and now we have code in the Kubernetes scheduler itself, which does the matching and tracks a resource allocation and assigns resources from nodes to a resource claim. There's no difference for the user. We still allow a vendor to define his own CRD for the parameters. They just then need to convert those parameters into the built-in resource claim parameter type, and the rest is handled by Kubernetes. So the writing a DRA driver actually becomes simpler with this model with a caveat that it needs to fit into what is supported by Kubernetes in terms of things that you can do with a built-in model. We have some things defined right now in 130, but it's literally just getting started, and we know that more work is needed in that area. And yeah, that's a recap of the old approach, and here's how it works with a new approach. It's basically just the task of a DRA resource driver to produce these resource claim parameters. The scheduler reads all of that information, doesn't need to communicate with anything. It can also do rapid pod scheduling now. It can basically look at one pod, assign resources, keep track of what resources it claimed, look at the next pod and do rapid pod scheduling like it does before with a built-in and native resources. So that is another advantage. Yeah, and the named resource model, that is what we got into 130 as a starting point. And Kevin, I think you will take over and explain more. Yeah, so before I jump into the details of some of this, I just wanna kind of set some expectations about this talk, which, you know, this is a maintainer track talk, and so we're not going through all of the details about how this whole mechanism works. And so if I just back up to this slide right here real quick, if you're very new to DRA and don't know anything about it, we're kind of glossing over a lot of the details, but if you watch this talk from last year in Amsterdam, it gives a very good overview of what DRA is, how it works, and what these different abstractions that we're talking about are resource class, resource claim, class parameters, claim parameters and so on. And so, you know, this is really kind of, the purpose of this talk is to highlight, you know, the state of the project and where we see it going in the future. And so, I just wanted to set expectations on that in case you feel a little bit lost, that's... It's my fault. That's where this is coming from. So yeah, so as Patrick mentioned, you know, the world of DRA, the way it looked in the original incarnation of it, looked something like this, where you have, you know, the Kubernetes scheduler talking with the DRA resource driver and the DRA resource driver is in full control of how these resources get allocated. And in fact, the API between these two is a very simple API. It's Kubernetes scheduler looks at the constraints that it knows how to resolve. How much CPU do you need? How much memory do you need? You know, some of these built in and even the extended resource types that it would need to use to figure out what node a pod could land on. And if it sees a claim, a resource claim, referenced in that pod, it looks up what resource driver that's associated with and just says, I don't know how to deal with resource claims. Hey driver, you figure out how to narrow down this list of nodes that I've come up with with the constraints that I know about and tell me what nodes from that set you might be able to allocate resources for this claim on. And then this process goes back and forth until an actual node is found where you can allocate those resources. And that's where this long loop comes where, you know, scheduling might be slowed down. The interfaces to doing cluster auto scaling don't exist because all this logic is custom inside this resource driver and the built in scheduler knows nothing about this. And so as Patrick mentioned, this move towards a model where we have built in parameters basically allows these in tree types to exist that your driver can advertise its resources to and any vendor specific logic around how you might select a specific resource or configure that resource at the end of the day once it lands on the node can be encoded in these in tree types that the scheduler can actually look at and make decisions on. And so in this picture here, everything that's in green is a vendor specific component and everything that's in blue is an in tree type. And so you can see how, you know, the Kubla plugin will advertise something to the Kubla so it can write this in tree type called a resource slice that the Kubernetes scheduler can now pick up to figure out what resources are available so that it can make this node selection decision without having this back and forth with the custom driver. And so as Patrick mentioned, you know, in 130 for this built in model, we came up with this reference implementation of something called named resources. And if you're familiar with the existing device plugin API, basically what the device plugin API allows you to do is advertise a list of opaque strings back to the Kublet which represent the resources that you might be able to allocate. So in the world of GPUs, you might advertise that I have eight GPUs and the opaque strings that represent them are GPU zero, one, two, up through seven, right? That's the only information you pass back to the Kublet. The Kublet then tracks that as the set of resources that it's able to allocate and then it writes into the API server just a simple count of how many it has back into the node object. So you know, you've passed back these eight strings, Kublet has recorded those eight strings and said that there's eight of this resource type available on this node. And then the scheduler makes all of its scheduling decisions based on that. In this built in resource models for DRA, we're basically doing the same thing except instead of passing back just a string, we're able to pass back a name of a resource and then a set of attributes that are attached to that. So in the example I'm showing here, the instance that I'm passing back is called GPU zero. That's similar to this opaque string that I would have passed back in the traditional device plugin but now I've got all these attributes on it. So I've got an index of zero associated with this device. I've got a UUID with this long string. I've got a product name of NVIDIA A100SXM4. I say that there's 40 gigabytes of memory on this and so on. And so you can kind of imagine how this is gonna be used by the schedulers that instead of just having the ability to say, okay, I'm gonna allocate some arbitrary GPU, you can actually do very precise selection on the type of GPU that you want, right? You can match against these attributes based on some selection criteria for what type of resource that you're actually trying to get access to. And so people have worked around this today by using some combination of node attributes and other mechanisms where you basically have to have a homogenous set of devices on a node. And if you have that, then you can use labels on the nodes to try and get access to a specific type of GPU because you know that if you land on that node, that's the only type of GPUs that are available there, right? But with this model, we can actually do precise device selection which will kind of automatically enable you to have different types of GPUs on the same node at the same time. And again, using this analogy of what the existing device plugin API looks like, if you're familiar with that existing device plugin API, you'll know that there's a list and watch API call that exists in the drivers that are written for that which allow you to stream this list of devices, as I mentioned before, as opaque strings back to the cooblets so that it can update the node with how many are available. And in this new structure parameters model, it's a very similar thing except instead of streaming back this opaque list, you're streaming back an entire model that looks like what I had on this last slide. You're streaming all of this back and instead of just account being made available to the scheduler, this entire object is being made available to the scheduler so that it can do this precise device selection based on that information. And yeah, if you send this once when the node starts up and then if something goes wrong, like some resource goes unhealthy or whatnot, you can update, you can send an update of the stream of things that might have fallen off the bus or gone unhealthy for some other reason. And so if we go back to this picture that I had before, the cooblet plugin of the DRA driver basically satisfies this part of the picture, right? So you've got your DRA resource driver, cooblet plugin advertises its resources through the streaming API to the cooblet. Cooblet uses that to write a resource slice into the API server and now the scheduler has the ability to look at this information when it's trying to make scheduling decisions for your pod. On the other side of that is this notion that Patrick also mentioned of introducing this entry abstraction called the resource claim parameter object. In both original DRA for opaque parameters and our built-in parameters model, we do have the notion of allowing you as a vendor that's writing your device driver to introduce vendor specific claim parameters. And that's what you're seeing here on the left. This is the claim parameters object that our GPU driver for DRA implements. And it allows you to specify both selection criteria, meaning how can I get access to a specific type of GPU as well as configuration criteria, meaning once this GPU has been allocated to a pod and it lands on a node, how do I want that GPU configured for use on that node? And so you can see it's divided into two sections for that. So at the top, you can see that you have, in this spec, I want account of one GPU that follows the selector expression, saying that I want the, basically I want a GPU that has A100 in its product name and it must have less than or equal to 40 gigabytes of memory. And once it lands on the node, I want it configured for anyone that shares access to this GPU. I want it to use time slicing with a long time slice configured, as opposed to a short time slice or a medium time slice. And so the only job then of the controller piece of a DRA driver that you need to implement, as opposed to the opaque parameters model where the controller had to do all of the logic for actually allocating, because the scheduler didn't know how to do that. Now all your controller needs to know how to do is to take this vendor specific claim parameters object and translate it into this entry generic representation so that the scheduler knows how to actually operate on that. And the way that it does this is that it generates that claim, the resource claim parameters object and then it puts a back reference in that to your original resource claim object so that the scheduler knows how to find this to attach other pieces of information to the resource claim parameters object. And so there's an explicit generated from section that refers back to your vendor specific claim parameters object. There's also a generated cell expression for selection. So the scheduler knows how to interpret this cell expression. So if you're not familiar with cell, it's a, I don't know how you would, how do you describe cell? A simple expression language that can be evaluated in constant time, it's strongly typed in this case. So it evaluates always to a Boolean and there's some validation going on that you are not calling something that is not defined in this particular context. So it can be extended. These attributes.string, this is something that comes that we are adding in this particular cell environment to grant this expression access to the resource attributes. So basically what we have expressed over here can be translated into a cell expression. And what Patrick mentioned with the attributes is that in the resource slice that I showed an example for before, there was a set of attributes attached to the resource and you can basically match on those here. And that allows you to do this precise selection of the GPU that happens to be available for what you're asking for. And then any vendor specific things that the resource claim parameters object itself doesn't actually care about but you need available on the node when you go to actually allocate these resources, you can tack those onto the end in this specific vendor parameter section. And then from the perspective of the resource claim parameters object, this is just an unstructured type that's tacked on there and then you have to reinterpret this once the node itself starts to allocate resources for this. And so yeah, if we go back to this picture now, this is kind of the top half of the picture where the vendor specific controller sees one of these vendor specific claim parameters objects created, it generates an entry resource claim parameters object that the scheduler that knows about and then putting this all back together. Now the scheduler has all the information it needs to select the node containing the requested resources alongside all of the other constraints that's resolved for any other types of resources that it has. One of the earlier slides we pointed out that we have an example DRA driver repo where you can go in and play around with some of this stuff. We haven't updated that for structured parameters yet but we plan to once kind has an image that's been released for 130. So I mean, 130 is not even out yet, right? If you wanted to start trying to play around with this you'd have to compile it against master of Kubernetes but once kind has an image that's released for 130 we will also coincide that with having structured parameters implemented in this. That said, I have already implemented structured parameters for our NVIDIA GPU driver so if that's what you're interested in potentially playing around with and seeing how it works you can go to this branch and see what changes were made from the opaque version of this implementation. And speaking of which, even with this very, very simple model for named resources, I have this document that I put together a few months ago which called NVIDIA GPU use cases for dynamic resource allocation where I outlined 12 use cases that DRA is able to solve for GPUs that you can't do with the existing device plugin API. And even with this very simple named resources model we already cover six of those 12 use cases which is actually a pretty good number given how simple this model is. And these are the, if you watch my talk from KubeCon last November these are the same use cases that I identify there where I go into lots of details about how DRA works and how specifically it maps to the different problems we're trying to solve with GPUs. So I encourage you to check out that talk if you're interested in that. And so these are those six, actually the full 12 use cases of what's supported and what's unsupported. So I just wanna really quickly name each of them so that you see the difference between what is supported and what isn't. So the things that named resources actually gives us the ability to do is control GPU sharing, right? So I can create a claim, I can have two different containers point to that claim and now they get shared access to that in a very controlled way because the claim is where this resource is bound to. I can also get GPU selection via complex constraints. That's this whole selection mechanism that eventually gets translated into a cell expression. I can still do that with this resource model. You can have multiple GPU types per node. I can have an A100 and a T4 sitting next to each other. And as long as I put my selection criteria together appropriately, I can pick one versus the other. You can still do user-driven time slicing support across a subset of GPUs. That's this vendor specific information that you tack on the end. At the node level, you can still interpret that the same way you would have in the opaque parameters model. And so there's no changes there. Same thing with MPS. MPS is the, I mentioned this this morning in the keynote, but it's a way of doing space partitioning instead of time partitioning. And you can still take advantage of the MPS support because again, this is a node level thing. We just tack on to the vendor parameters at the end of our resource claim and they just get passed along to the node to actualize them. And then the last one is dynamic swapping of an NVIDIA driver with a VFIO driver depending on intended use case of the GPU. So this is for being able to support virtual machines like Kubevert or Cata need this kind of capability so that you can on a GPU by GPU basis say, hey, this GPU is gonna be injected into a virtual machine? Cool, I want that governed by the VFIO driver rather than the native NVIDIA GPU driver that's sitting on the host. And these other six are unsupported at the moment. But that said, by 131s, one release later, we already have plans to support the majority of these just within the next release cycle. So we'll be able to support 10 out of these 12 where we can do some of the dynamic allocation of MIG devices, we can do MIG device alignment, subdivide MIG devices, basically support partitioning of devices rather than allocating out just full GPUs. Which is still unsupported, but we do have plans to support fairly soon. We have designs in place, but we're still trying to wiggle out the details of that is the ability to do custom policies to align multiple resources such as GPUs and NICs or even between two different GPUs, making sure that you have the optimal connection with in terms of NVLinks and NUMA alignment and things like this. The last one though, where you can do very, very custom application specific policies for how GPUs are allocated across containers and pods. I don't see how we'll ever really be able to do this with structured parameters. This is, it's possible with the opaque parameters model, but I really don't see how we're ever really gonna be able to do this with structured parameters because this is the type of thing where you might have an application that says, I have two pods, if those can both fit on the same GPU on one node, give them access to that same GPU. If there's no single GPU where these could run, but there's two GPUs on a node where they could each get one of them, give them each one. If that doesn't exist, but there's two nodes that have a GPU that could satisfy one of the two of them, great. Run them on those separate nodes, but also allocate an RDMA device that can allow them to communicate very quickly between each other, right? And with opaque parameters, you can actually write a controller that encodes all of this logic and does what I just said, but there's no way we're gonna be able to do this in a generic way. And so we need to think hard about whether we wanna be able to support those use cases that they just described or not if we ever wanna try and get rid of the opaque parameters model going forward. And one thing to point out that at least in this move from opaque parameters to the built-in parameters model that we have today, from a GPU's perspective, from the user of the GPU driver's perspective, there are no changes needed to migrate from DRA with opaque parameters to structure parameters. So if you've already started playing around with this a little bit and have been experimenting with it, if you decide to try and play around with the built-in model and you're only doing things with one of those first six out of the 12 use cases, nothing should have to change from your perspective in this migrating to 130 from 129. And I'll let Patrick then finish up with this last slide. Yeah, so the question that we discussed at the Copioletti's Contributor Summit really was, can we promote something to beta in this year? Which is fairly aggressive. So for 131, yeah, we kind of at the end decided, yes, we can. We just need to figure out exactly what it is. What's the scope limit that we can achieve in 131? So many of these enhancement that we still need for some additional use cases that needs to go in. Then we can finalize the API structure parameters, will be the default mode. And then we kind of keep the rest of the functionality, the core or traditional, the DRA, depending on how you want to call it, hidden behind the second feature gate. In Kubernetes 132, then ideally we have some more feedback, we know that it works. And then we just flip the feature gate for structured parameters. That becomes beta in 132, hopefully. The rest will still be there because we still see cases where it's useful, exploring additional controller logic that is where we need for pod scheduling context. So this will still be available. And then going beyond that, we need to really think hard about whether we can still support it or whether we arrive at a structured parameter model that is so sufficiently capable that it works well enough for the majority of use cases and then we might drop the rest of the code because the concern of the maintainers is clearly the complexity that we are now adding to Kubernetes, that this is not maintainable in the long run because there are a lot of if check, if else cases now in the scheduler to support both of these models. And that just may be too complex for the future. But we are going to discuss that when we get to that point. Right now, the focus is 132. There's a lot of support in the community because we all agree that Kubernetes needs to do something that makes AI training and inference work better natively without any hacks that you might have to use with device plugins. So there's a strong support from the community for the structured parameters. And I'm hopeful that we can do it. And that's it. So I think we do have time for questions. Five minutes. There's a microphone over there I see. So I have a question about how will this work with network devices where you can have, say, a device that can only take one connection at the time? So will the scheduler be able to know that it's the same device that it exposed on different nodes and ensure that it isn't used at the same time by different nodes? So that would be a network attached device that gets connected to a node. This is one of the use cases where we are currently still exploring how to use structured parameters. There's no concept for publishing that such a resource exists in the cluster in the first place. That's one of the missing pieces. And then the logic of, okay, so how do we allocate it and when do we enact any kind of cluster reconfiguration? Does that happen when a pod starts to run? That may work with a current model. So you would have to have a demon set that knows how to reconfigure that network attached device. That might work. But yeah, it is a use case that currently has taken slightly the backseat to just get node local resources to beta. Okay. But the core of the traditional DRA is still there. So if people want to explore such use cases, you can, you just have to use the traditional DRA with your own controller. And then it, yeah, that's your limitation. You don't get cluster auto scaling because we don't have a solution for that. But you can explore. You can do prototyping. You can come back with your use cases to the community and say, okay, this works perfectly for my network attached device. Let's figure out how to do the same thing with structured parameters. That's ideally the result of all these discussions. Have this been designed with multi-tensi in mind so that you can use resource quotas? We have some very simplistic idea of doing quotas at the moment. It's like volumes, you get both, you just count the resource claims per name space. That would be fairly simple to support. I do have a PR pending for that. We didn't merge it yet because it wasn't sure whether it's sufficient. It probably isn't because one claim can be for some very cheap hardware or it can be a big GPU. We need these structured parameters to make the quota mechanism more capable and more specific. It is probably something that can be added but for the sake of, yeah, having a reasonable scope feature set, we are not currently planning that for 132, to be honest. It's also worth pointing out that with the opaque parameters model, you could bring along your own custom way of doing quotas because you have full control of if you want to allow the allocation to happen or not from within your controller. And we were planning, until the whole structured parameters discussion started, we were planning on shipping our own custom quota mechanism along with the Helm chart that deploys our driver. But we just never bothered doing that because we've kind of pushed things more towards our efforts towards the structured parameters model now. All right, thank you. I have a quick one if there's time. What about scheduling? Once you have scheduled the part and you have the resource slice, do you optimize like we do with persistent volumes where it was scheduled, perhaps with some labels and the nodes? I like topology. Not sure whether I get the questions though, you. So if a deployment gets some resources, GPU, and you need to reschedule it, do we optimize so that we don't need to provide the whole list of nodes and do the evaluation again so that we perhaps put it in the resource slice resource which node it had previously? So there is no rescheduling at the moment. That is not supported by Kubernetes itself. So we are bound by what is supported here by Kubernetes. Once a part has been scheduled, it's bound to that node. It can only be deleted if it cannot run and then we rely on an app controller to recreate the part and that part will get scheduled again. There is a cap in flight where someone is proposing to allow rescheduling. Basically, Kupelet can mark that part as not suitable for this node. Try again elsewhere. But this is a new concept in Kubernetes. We don't have that yet. All right. Last question. I'm afraid we are at the end of the session. Yeah. You mentioned that the resources you define with the dynamic resource allocation will be used for scheduling only. But is there a plan to use them also for setting a hard limit for the resources like the memory and the CPU count and to evict the ports if they overpass their limit, for example, in the user of GPUs or network? So we don't really plan to touch anything with CPU or RAM at the moment. There are some experimental DRA drivers that add additional constraints about which CPUs may be usable for a port. But it's only really piggybacking on this whole effort of having a custom API and then doing something special in their DRA driver on the node, but it's not a main feature that we are trying to support. Okay. Well, thank you.