 Hello, my name is Kevin Clues and I'm a Distinguished Engineer at NVIDIA. I'm here today to talk about a new Kubernetes feature called Dynamic Resource Allocation that will enable us to unlock the full potential of GPUs for AI workloads on Kubernetes. As a community, we've already built a solid foundation for GPU support in Kubernetes, and Dynamic Resource Allocation is the feature we've been waiting for to take us to the next level. So what does GPU support in Kubernetes look like today? Well, it requires a mix of host-level components, as well as Kubernetes-specific components to be deployed on each node of a cluster that contains GPUs, where the host components consist of the NVIDIA container toolkit and the GPU driver itself, and the Kubernetes-specific components consist of those shown here, where the K8's device plugin is the one that ultimately makes the GPUs themselves visible to Kubernetes. With these in place, requesting access to GPUs takes the form of an extended resource called NVIDIA.com slash GPU, and specifying how many GPUs you want access to. Under the hood, the scheduler, Kublet, and K8's device plugin will coordinate to find a node where GPUs are available, schedule the pod there, and inject the requested GPUs into the container that requested them. If you need access to a specific type of GPU, a combination of node labels and node selectors can help direct the scheduler to a node containing the specific type of GPU you are looking for. In this case, an A100 with 40 gigabytes of GPU memory. You also have the ability to request a fraction of a GPU through a technology called Multi-Instance GPUs, or MIG for short. In the example shown here, we're requesting a MIG device that is 1 eighth the size of a full A100 GPU using the canonical naming convention of MIG-1G.5GP. In the last year or so, we also added support for sharing access to GPUs through oversubscription. On a per node basis, cluster admins can configure the number of replicas they wish to oversubscribe each GPU, and users request access to these shared GPUs with the special dot shared extension shown here. Under the hood, simple time slicing is used to swap one workload off and put another workload on the GPU after some fixed amount of time. And this can be layered on top of the fractional GPU support provided by MIG as well. We also provide an alternative to time slicing through a technology called Multi-Process Service, or MPS, which provides a software solution for running multiple jobs on top of a GPU or MIG device in parallel. Unlike time slicing, however, there are some caveats to getting it to run properly, as well as some extra burden on the cluster admin to configure it properly. With some changes to the system software, GPUs can also be made available to virtual machines such as Kubevert and Cata. Instead of running the NVIDIA GPU driver directly on the host, it has instead run inside each virtual machine individually with the standard Linux VFIO driver making the requested GPU hardware visible to each VM. OK, so if that is what we can do today, what can't we do? Or at least not do very well. Well, first and foremost, there is no support for having more than a single GPU type on a given node. As I mentioned before, node selectors can be used to direct the scheduler at a node that has a particular type of GPU, but that node itself must have a homogeneous set of GPUs to pick from. We've talked about enabling support for this by allowing admins to customize the name of the extended resource used to advertise each type of GPU on a node. However, the names start to get overly complex and long, the more properties you encode in them, and we've never been able to agree upon the right way to do this, so it just hasn't happened. One of the most requested features we get is the ability to ask for a GPU subject to some complex set of constraints, such as its minimum CUDA compute capability and or the total amount of memory it has available. Unfortunately, node labels are not sufficient to encode all of the constraints that users find useful, and attempting to encode them in the name of the resource itself has the same problem mentioned previously with supporting more than one GPU type per node. Providing users with the ability to oversubscribe GPUs and share them among containers was a huge win for many of our customers. However, there is no way to control exactly which container will get access to which replica of which GPU, limiting the overall usefulness of this feature for users with more stringent sharing requirements. Sharing via MPS has these same problems, with the additional caveats mentioned previously of requiring a host-level controlled daemon to be running, as well as requiring all pods that access GPUs to be run with host IPC enabled. I didn't go into the details of this before, but there is currently no ability to dynamically provision MIG devices based on incoming resource requests. There are tricks you can do to add some level of dynamism to this process, but they are all shoehorned on top of the existing device plug-in API and involve complex logic to trick Kubernetes into doing the right thing, meaning that the official case device plug-in for GPUs simply doesn't support this feature. There is also no way to dynamically associate some GPUs with the native NVIDIA GPU driver and others with the VFIO driver such that a mix of containers and VMs can be run on the same underlying node with GPU support. It's technically possible to split the GPUs in this way. We just don't have a good way to do it dynamically based on incoming resource requests in Kubernetes. And with the advent of confidential computing and the hardware support we have for Inhofer GPUs, this is becoming an increasingly common ask for many of our customers. And as you can imagine, the list of limitations goes on. But this is where dynamic resource allocation comes in. In addition to supporting all of the existing use cases covered by the GPU support we have in Kubernetes today, our DRA Resource Driver for GPUs addresses all of the limitations mentioned as well. And the rest of this talk is dedicated to showing exactly how this is done. So we'll start off by giving a brief overview of DRA itself followed by the specifics of the DRA Resource Driver for GPUs we have implemented at NVIDIA. I'll then jump right into a series of demos showcasing some of the more advanced features provided by this driver. As its name implies, DRA is a new way of requesting access to resources in Kubernetes that has been available as an alpha feature since 126. It provides an alternative to the existing count-based interface of asking for NVIDIA.com slash GPU2, for example. And what makes it so powerful is that it puts full control of the API used to request resources directly in the hands of the vendors developing drivers against DRA, where the key abstractions to keep in mind are that of the resource class and the resource claim, which are in-tree APIs referenced directly within a pod, as well as their corresponding class parameters and claim parameters, which provide a means for vendors to bring their own custom APIs to these abstractions. In this talk, I won't go much further into the details of DRA itself and how it works under the hood, but I encourage you to watch my talk from KubeCon EU last April if you'd like to learn more. Okay, so assuming a cluster admin has enabled DRA and deployed the necessary DRA resource driver for GPUs onto a cluster, what does a user actually have to do to request access to a GPU? Well, starting from the example we used previously of requesting access to two GPUs, the equivalent request under DRA would look like this, where instead of requesting the two GPUs from an extended resource called invidia.com slash GPU, you instead create a resource claim template object pointing to the GPU.invidia.com resource class. You then reference that template twice inside a new resource claim section of the pod spec, give each reference a local name, in this case GPU0 and GPU1, and then reference those claims from within a new resources.claims section of the container spec. Since this is a template, each reference will trigger a unique claim to be created under the hood, resulting in each reference pointing to a unique GPU, just like we had with the traditional extended resource interface shown on the left. Now, this may feel overly verbose for the simple case of requesting exclusive access to a set of GPUs, but separating the declaration of the resource claim from its consumption means that we can naturally enable sharing of GPUs in a controlled way. In the example shown here, we reference a single resource claim for multiple containers within a pod, resulting in both containers getting access to the same underlying GPU, and by directly creating a global claim object as opposed to a template, we can reference this claim for multiple pods, providing a natural mechanism for controlled GPU sharing across pods as well. Now, this takes care of the limitation mentioned previously of not being able to control how oversubscribed GPUs get shared, but what about the rest of the limitations? Well, if you remember, I mentioned earlier that one of the most powerful features of DRA is the ability for vendors to define their own API for requesting resources, and this comes in the form of class parameters and claim parameters attached to resource classes and resource claims respectively, where resource classes are used to define the set of available resources themselves, as defined by a cluster admin, and resource claims are used by end users to ultimately request access to those resources. Since this talk is focused exclusively on the end user experience, I won't go into the details of the class parameters provided by NVIDIA's DRA Resource Driver for GPUs, but in general, they work similarly to the claim parameters I'm about to describe. So what do these claim parameters look like? Well, using the examples of the resource claim and resource claim template from before, you can basically just tack on an extra section to these objects called parameters ref, which pointed an object that the DRA Resource Driver for your specific resource type knows how to interpret, where the simplest claim parameters supported by NVIDIA's DRA Driver for GPUs looks like this. A CRD called GPU claim parameters with a count of how many GPUs you want allocated to the claim, meaning that an alternative to our original example of requesting two GPUs being injected into a single container could have looked like this, where our claim parameters object declared that we want two GPUs to be allocated to the claim, and we plumless through the resource claim template referenced by the pod. Okay, so how can we expand on this claim parameters object to knock out the rest of the limitations listed previously? Well, let's start with this one. Better support for MPS. A sharing section can be included that defines the strategy with which GPUs allocated to this claim should be shared. And one of the strategies that can be selected is MPS, where an optional set of parameters can be provided to customize how the MPS server does its partitioning. And that's basically it from an end user's perspective. As part of allocating the claim, an MPS server will be automatically started in the background with the parameters provided and torn down once the claim has been deleted. Now, it's also worth mentioning that the default sharing strategy of time slicing can also be provided here in order to customize the duration of the time slice used. Where the values of default, short, medium, and long map directly to the inputs you would provide to NVIDIA SMI to control this from the command line. Now, the next ones we'll take a look at are these two. And since the same set of fields in the GPU claim parameter spec will allow us to overcome both of these limitations, we'll consider them together. Where the basic idea is to define a selector field, which can either contain a single property, such as the GPU model you would like allocated to the claim, or a list of properties anded or ordered together recursively up to a nesting level of three. As a concrete example, consider the following selector, which reading from top to bottom says, give me either a T4 or a V100. And if you give me a V100, make sure it has less than or equal to 16 gigabytes of memory. Such a query would be useful, for example, to find an appropriate GPU to run an inference server on. Now, the set of constraints that can be provided here are basically anything you would be able to query about the GPU using NVIDIA SMI. And under the hood, the DRA driver for GPUs will not discriminate where these GPUs come from, meaning that you can now place different types of GPUs on the same node if desired. Now, these last two limitations are interesting because we've chosen to overcome them a little differently than the others. Specifically, by introducing a different claim parameters object than the standard GPU claim parameters object we've been working with so far, where the MIG claim parameters object has a single field representing the MIG profile you would like to have allocated to the claim. And the VFIO GPU claim parameters object has a single selector field, which is the same as the selector field available to standard GPU claim parameters objects. Whenever a claim is allocated, it references a MIG claim parameters object, a GPU capable of dynamically creating a MIG device of that size is found, the MIG device is created and then bound to the claim. When the claim is then later deleted, so is the MIG device. Freeing the GPU up to create MIG devices of different sizes or be used as a full GPU again, depending on future incoming requests. Whenever a claim is allocated, that references a VFIO GPU claim parameters object, a GPU matching the specified selector is found, swapped off of the standard NVIDIA GPU driver, onto the standard Linux VFIO driver, and then bound to the claim. The details of how the runtime that ensures that these GPUs eventually make their way into the VMs is out of scope of this talk. But all of the necessary annotations required to do so are added by NVIDIA's DRA driver for GPUs as part of the overall claim allocation process. And with that, we've been able to demonstrate how DRA can help us overcome the primary limitations we have with supporting GPUs and Kubernetes today. Now, before I go on, it's worth pointing out that everything I've talked about so far has already been implemented and is available for you to play around with today, with one exception. This last part about the VFIO GPU claim parameters, we do plan to add support for this soon, but we are currently focused on getting a proper release of the rest of the features out, so it won't be available until some time next year. And with that said, here's where you can go to download NVIDIA's DRA driver for GPUs and start playing around with all of this today. The driver itself is deployable as a home chart and we have scripts to get you up and running quickly in both kind, assuming you have access to your own GPU hardware, as well as on a GKE Alpha cluster, which allows us to actually make use of DRA in a managed Kubernetes cluster, since GKE Alpha clusters enable all Kubernetes feature gates by default. In fact, we've used these scripts to set up the demos I'm about to show you now. Where the first demo runs within a kind cluster on a DGX-A100 box with eight A100 GPUs available, and the second two run on a GKE Alpha cluster with two different GPU node pools available. One containing a set of T4 GPUs and one with V100s. So let's get started. This demo shows some of the advanced GPU sharing features enabled by NVIDIA's DRA resource driver for GPUs. In particular, it shows how one can request access to a hardware partition GPU as opposed to a full GPU, and then layer a sharing mechanism on top of that that lets you further subdivide its resources among multiple clients. Each hardware partition is referred to as a MIG device, and the sharing mechanism, called MPS, relies on having a dedicated server process launched to service each client. When directed, the NVIDIA DRA resource driver for GPUs will dynamically create one of these MIG devices and launch an MPS control demon to service its shared clients. So jumping over to the code, we can see three windows. The first contains a job specification with its parallelism set to four. This means that four separate pods will be launched as part of this job. Each pod then has a reference to a resource claim called MIG MPS sharing, which in turn is referenced by the single container embedded inside it. This container simply invokes a long running and body simulation designed to keep the GPU busy throughout the demo. Now moving over to the second window, we see the actual specification of the resource claim itself, followed by the third window, which contains our custom claim parameters object, populated with the settings needed to request the smallest MIG device possible with MPS sharing enabled. Now to make this demo a little more interesting, we're not going to launch this job with just this single container inside it. Instead, we're going to run it alongside three other containers, which provide a baseline for the feature it is demonstrating. The first is a full GPU with standard time slicing enabled rather than MPS. The second is a full GPU with MPS sharing enabled. And the third is a similarly sized MIG device with standard time slicing enabled. As before, we have the claims to represent these as well as the claim parameter objects to define their settings. Okay, so now for the interesting part. Let's start off by showing the current state of the cluster, which shows that we are running in a single node setup with our DRA resource driver already deployed. Next, I'll show the current state of the machine in terms of how GPUs are partitioned. At the moment, there are no partitions created and we simply see eight full A100 GPUs available for allocation. Now let's go ahead and create a namespace to run our demo in and launch our job in its associated claims. Now that that's launched, we can run kube control get a few times to wait for all of the pods to show up. As you can see, there are four pods associated with the job and two dynamically provisioned MPS control demons just as we would expect. With this next command, we can see that two MIG devices have been dynamically created inside GPU four as well as all of the processes from our end body simulation running across all of our GPUs and MIG devices. To get a closer look at the processes associated with each device, I'm going to run a script that parses the output shown here to grab a reference to each of the four devices that have been allocated and set an environment variable for them appropriately. I'll then run a series of Docker containers that have access to each of these devices individually so we can see the processes running inside them. The first, let's me see the processes for the full GPU with standard time slicing. And the second, let's me see the processes for the full GPU with MPS sharing enabled. As expected, the one on the top shows four compute processes. That's what the C here stands for. Whereas the bottom shows five processes, one MPS server, and it's four clients denoted with the M plus C seen here. If I ended the same for the two MIG devices, I have similar results. Finally, I can clean up my environment by deleting the job and all of its associated resource claims, which will then trigger the MPS control demons to shut down as well as delete the dynamically created MIG devices on GPU four. For this next demo, I show how to use the selector field in the GPU claim parameters object to direct the allocation of a resource claim to a particular type of GPU. To do this, I have pre-provisioned a GKE Alpha cluster with three node pools, one for running the control plane services, one containing T4 GPUs, and one containing V100s. The DRA resource driver for GPUs has already been pre-deployed on this cluster, as well as an extra helper demon required for DRA to work in this environment. Specifically, to install the NVIDIA container toolkit and make sure that support for the container device interface is enabled in container D. So jumping over to the code, we see two windows. The window on the left shows the set of commands I'm gonna walk through to demonstrate this feature, and the window on the right shows where these commands will be run. So let's start off by first listing out all of the nodes in the cluster. As you can see, we have three control plane nodes at the top coming from the default pool and three nodes at the bottom. The first two of which are from pool one and the last one is from pool two. Next, I'll show the list of what are called node allocation state objects associated with our DRA driver for GPUs. There's one of these objects per GPU node, and they hold the state of which GPUs are available and which ones are currently in use on the node. As you can see, we have three of them. One for each of the three nodes from pool one and pool two. Now, dumping their contents, you can see there are two nodes from pool one with T4s on them, and one node from pool two with a V100 on it. Note the amount of memory each GPU has as that will play an important role in how our GPU selectors decide which one to grab later on. Now, jumping over to the specs I've defined to allocate claims and deploy pods in this setup, we see three windows. The first contains two sets of GPU claim parameter objects. One for requesting access to what I've called an inference GPU, and one for requesting access to what I've called a training GPU, where the inference GPU must contain less than 16 gigabytes of memory and have a CUDA compute capability greater than or equal to 7.5. And the training GPU must have more than 16 gigabytes of memory. As you can imagine, on the current cluster, this basically translates into a T4 for the inference GPU and a V100 for the training GPU. Now, the second window just defines the resource claim templates that refer to these GPU claim parameters objects. And the third window defines the pod specs which reference these claims. One called inference pod and one called training pod. For the demo, all I'm doing is running NVIDIA SMI to verify which GPU I have been granted, and then sitting in a sleep loop so that I can pull this info from the logs later on. So jumping back to the script, I first create a namespace called kubekan demo and then create all of the objects referenced in my specs inside this namespace. Once that's done, I run kubekontrol get pod to show that my pods have started running, followed by another call to print the set of GPUs that have been allocated in each node allocation state object. As we can see, one GPU has been allocated to the inference pod from pool one and one GPU has been allocated to the training job from pool two, which is verified by printing the logs of each pod as well. For the final demo, I will be showing how we have built a POC of integrating DRE support into one of NVIDIA's flagship AI products, the Triton Management Service, or TMS for short. For those of you not familiar with TMS, it provides an automated way of deploying multiple Triton inference servers onto a Kubernetes cluster, each of which may serve models with varying GPU requirements. At present, there is no good way for TMS to pick and choose which GPU a given server will be given access to when running in a mixed GPU environment. By integrating with our DRA resource driver for GPUs, TMS is now able to right size the GPU given to a particular inference server. It does this by translating a model's GPU requirements into a set of selectors, building a GPU claim parameters object from it, and then referencing that in a resource claim. The same GKE cluster used in the last demo is also being used here. So just like before, let's start off by showing the set of nodes in the cluster as well as the node allocation state objects listing which GPUs are available on which nodes. Next, I'll show the TMS server running in the KubeCon namespace on our cluster. And reference a command I've already run in the background to port forward access to the TMS server onto my local host. Next, I'll use a command called TMS-CTL to launch two different Triton inference servers. The first specifies that it wants a Turing GPU with less than 16 gigabytes of memory. And the second specifies that it wants a voltage GPU with more than 16 gigabytes of memory. Once these are deployed, I'll run kube control get pod in the KubeCon namespace to verify that they're up and running. Once that's done, I'll dump the node allocation state object to show which GPUs have been allocated on which nodes. As expected, we see one T4 GPU allocated from a node in pool one. And one voltage GPU allocated from a node in pool two. Executing into each server and running NVIDIA SMI then verifies which GPU was granted to each server. And that's it. Thanks for listening. If you have any questions, feel free to send me an email or find me on the Kubernetes Slack.