 All right. I think it's time to start. Thanks for the jokes. So welcome to our session. Thank you for sticking around until the last session of the day, also walking to the last part of the building. Welcome to our session, the hidden heroes behind AI, making sense of GPUs and TPUs and Kubernetes. My name is David Porter. I'm from Google. I work on the Google Kubernetes team. I work on Node. I'm also a maintainer in Instignode. This is Evan. Hi. I'm Yvonne. I am with NVIDIA as the research says. I'm on the cloud native team, and we build everything that's required to run GPUs in containers, right? So the container toolkit, if you've heard of that, the device plug-in, the GPU operator, all of that good stuff. So like these kinds of devices and accelerators have been quite a hot topic at this conference already, and they're becoming increasingly important to run these complex workloads like machine learning, training, and inference, and the demand for them has been growing over time as well. It's even more important for us to be able to access these GPUs, TPUs, and even the FPGAs in our Kubernetes clusters so that we can run these jobs. So in the talk today, we'll cover how Kubernetes actually integrates with these devices and sort of a little bit of a spoiler, but through device plug-ins, and how these device plug-ins and device allocation actually work. We'll also be looking like some usage examples of GPUs and TPUs on Kubernetes and giving some sort of thoughts, details, hints, tips on operating clusters with TPUs and GPUs. And a sort of a brief outlook on what we see as the future of devices in Kubernetes. Now first of all, what is a device? Well, essentially, it's this thing here, right? But that's not very useful. So a device with a capital D or resource is something that a user wants access to for a specific purpose, such as training a machine learning model. If one of you sort of like go down the various levels of abstraction or up through the various levels of abstraction, I suppose, you end up with a collection of device nodes, libraries, and utilities that are required to actually access to the device in an environment such as a container. And in Kubernetes, these are exposed as countable extended resources, which can then be requested by a user. And in order to do this, one requires a per node device plug-in. So these device plug-ins register with a cubelet and sort of under a specific name such as nvidia.com slash GPU, which I'm sure some of you have already seen. And this device plug-in then lists the devices available as a set of opaque IDs and may be able to provide specific hints to the cubelet in order to allocate these devices. Once an allocation of this device takes place, the device plug-in then can provide information about modifications that need to be made to the container spec as it's being created. So this includes device nodes, mounts, environment variables, annotations. And something new, there's an alpha feature in 128 and should be promoted to beta in 129 is the inclusion of specifying CDI device names. And sort of these CDI devices, so the CDI or container device interface is a CNCF-sponsored project under tag runtime and provides a way for vendors to declaratively specify what a device means, right, capital D device. This includes device nodes, mounts, environment variables, hooks, and tries to be for devices what the OCI runtime spec is for containers. And then these devices or these modifications associated with a device, capital D device, map to OCI runtime spec modifications, and once these modifications are made then you should be able to have access to your device. Now these devices also can be referred to by a locally unique name and a fully qualified CDI device name, which includes a vendor component, a class component in the name. And another thing to note is that the CRI was extended to include support for CDI device fields in the 027 release. Not just in terms of CDI and how it would work, right? First of all, like we as vendors would probably generate CDI, well, would generate CDI spec using some vendor tooling, maybe that should be vendor one, vendor two, vendor three. And this tooling could be run sort of once off. This could be seen as a static config. If you know everything about your device as a priori, then you could generate the specification, put it on your node somewhere, and then that can be available for things that need it. This could also be generated dynamically if you're busy trying to do something a bit more interesting. Sort of in terms of how they're consumed, a container runtime will receive some request by the CRI to create a container. The CRI request could include the CDI devices in the field, in the spec, or possibly as annotations, which were used before the field in the spec were added. Having selected a device, the container runtime, such as Cryo, container D, which both support CDI natively, reads the CDI specifications that were generated previously by the vendor tooling for the selected devices, applies the modifications defined by the CDI spec to the OCI runtime specification, and then sort of invokes Run C as normal. Run C would then read this OCI spec, which now includes the modifications, and start the container or create the container using that specification. Because it includes these modifications, that container that was created has access to the devices as required. Now we're going to zoom out a bit, and for that I'm going to hand over back to David. Cool. So now that we can understand what's going on, maybe at the low level, let's zoom out to understand how this fits into the whole Kubernetes ecosystem. So we have all these components running here. So where does it start? It starts with the actual device. So in this example, we have two GPUs, and you need to get set up a node, you need to get that hardware on there, and you need to install the drivers. We're talking about the actual kernel drivers that can interface with the actual device. So once we have that, the first thing is we need to deploy a device plugin. The device plugin is a component that the vendor builds that basically is the proxy between the KubeLit and the actual device. So in this case, you deploy the NVIDIA device plugin that knows how to talk to these GPUs. And the device plugin's job is basically to communicate with the KubeLit and the actual GPUs to advertise them. So the device plugin starts up, it talks to the GPUs, it says, hey, I have two GPUs, and then it talks to KubeLit. So KubeLit then actually makes these calls to the device plugin, and then it actually updates the node capacity to the API server. So the nodes already have some capacity for CPU and memory. And then now the device plugin will also tell the KubeLit. I also have nvidia.com.gpu2, whatever that means. And so the KubeLit and the other components in the ecosystem, they usually don't understand what nvidia.com.gpu is. It's just an extended resource, right? It's a name, and then it's some count of them. And that's what we mean by an extended resource type, right? So the API server now knows that this node has this much of this resource. And now the other actors in the system can call out to the API server and be made aware of this, right? So the next step, a user comes in and deploys a pod. And the pod comes in, and in the pod spec, you have some number of requests, right? So you're already requesting CPU and memory. And then you also add the request, and you put the same extended resource name, and you say how many you want, right? nvidia.com.gpu1, you put that in your container spec. And then from there, you submit that pod, it goes up to the scheduler, and the scheduler, since it knows around nodes from all the other nodes, the node capacity, right? It can figure out which node has that resource available and schedules the pod to a node that is suitable for it. So once that happens, the scheduling takes place. And the next step is you have Kubla here, right? And Kubla is actually seeing the pod. And it kind of does two things. The first thing it does is actually talks the device plugin, and we'll go in a little more detail in a bit. And then also, it talks the container runtime to actually start that pod. So it talks to the device plugin. It'll basically allocate a device for this pod. And then the device plugin will come back with some information, and then the Kubla will talk to the container runtime, and it'll basically pass back to the container runtime some extra information on how to make this device actually accessible by the workload. From there, the container runtime talks to RunC. RunC is the low-level library there, right? The low-level component that actually goes ahead and creates the actual container. And it'll have, in the specification there, the actual access to the device mounts, libraries, and other config to make sure that the workload can actually access that device. And then in some cases, there's some vendor magic here that is at either maybe the container runtime level or RunC level that also does a little bit of work to inject some stuff in some layer of the stack here. But that's kind of vendor-specific. And then at the last step here, RunC will go ahead and create the workload. It'll create the process that needs access to that device. And then when the workload actually starts up, it can just talk to the device directly. It doesn't talk to the device plugin or anything else here. It just has direct access to the device. So that's kind of the overall workflow of how a pod can get access to a device and how the device plugin advertises those devices. So one of the critical elements here is the device plugin. So we want to spend a little bit more time going into the device plugin, like how does that actually work? What are the steps in it? So the first step is device registration. Device registration is the process where the device plugin starts up and actually discovers what devices are running there. So to back up for a second, the device plugin you deploy usually as a demon set in your cluster. But it's just an arbitrary process that runs on your node, and it communicates to the kubelet. How does it do the communication? There's a well-defined device protocol. And it talks over gRPC. So it talks to the kubelet. There's a Unix domain socket that it can talk to and communicate over gRPC. So the device plugin starts up. It basically connects to kubelet and says, hey, I'm going to register. And I have this API resource type, and then I have this resource. This is where the nvidia.com slash GPU actually comes from. The device plugin tells kubelet I have this device available. From there, the kubelet goes in and talks to the device plugin. It basically says, what are the device plugin options? This is some extra information that the kubelet can obtain. Like for example, should the device plugin be aware before a container starts? It can also provide some hints around allocation. So for example, for some topology aware scheduling, it can provide some hints there. And then the last call is this list and watch. The list and watch is a gRPC streaming call. So what that means is kubelet makes this call, and it keeps that connection open. And the whole point of this call is to return the actual devices. So in this call, you'll get back the two devices, right? GPU 0 and GPU 1, and a health status for each one. Healthy equals true for both. And the reason this is a long-lived streaming gRPC connection is because the devices can go unhealthy. If a device goes unhealthy, the kubelet wants to be made aware of that. If the device goes unhealthy, it'll talk to the API server. It'll update its capacity, might update NVidia.com GPU from 2 to 1. And if a pod is actually using that GPU or that device, the kubelet will fail that pod and make sure that a new pod can get created to make sure that if another controller is managing that pod, new pod can get created that will continue the work needed. So that's the registration phase. That's how the kubelet and the device plugin kind of orchestrate the initial phase. So what's the next step? The next step is when an actual workload comes in, and we need to give that workload access to a specific device. So this is the device allocation phase. It's important to know here, actually, kubelet's the one that's responsible for figuring out which device to give to the workload, right? So the kubelet starts up, and it basically knows that it has these devices available, and it can pick GPU 1 to give to that workload. The device plugin can influence that decision by giving some hints to the kubelet. But ultimately, it's kubelet's responsibility here to figure out what device. So kubelet comes in and says, I want you to allocate GPU 1 for this pod, for this workload, right? And then the device plugin responds with basically these list of fields, right? And these fields are basically what changes need to be made to the container spec to ensure that the workload can actually access the device. So these are things like modifying, for example, some environment variables that are in the device, adding some mounts, adding some device mounts, annotations, and lastly, kind of the CDI devices, which were mentioned earlier, right? It can actually add that. And then kubelet later will take this whole information, apply these kind of patches to the container, and then give it to the container runtime to actually start. And then the last step of the process here is actually starting the workload, right? So kubelet will actually start the container. And right before it starts, it'll talk to the device plugin. I'll say, I'm going to start this workload with this device. And the device plugin can do something like, for example, initializing the device, whatever it needs to do to make that device ready to be consumed. So after this, the workload has started, and everything's great, the pod can use that device. So we want to now step back for a second and take a look at what are some of the other devices that exist, and how do they also make use of the device plugin and the Kubernetes ecosystem? So I want to introduce TPUs for a second. So TPUs are a special built accelerator for inference and training by Google. They're optimized for training and inference of large AI models, like your LMs that are so popular these days and gen AI models and other image models and so forth. And there's two flavors of TPUs that exist. We have TPU devices and TPU slices. So TPU devices are these independent devices. So the whole idea here is that they're not interconnected to other TPUs. It's just one device that one workload uses, kind of like a GPU. And then TPU slices are also an interesting kind of different flavor of TPU that are basically groups of TPUs. So there are multiple different TPUs on different machines, all interconnected with this very high-speed interconnect, and that allows to get very good performance out of them. So these TPUs, you can use existing ML frameworks, PyTorch, and Jax, and TensorFlow, and so forth to be able to write workloads against them. But the interesting thing here is, how does Kubernetes play into this? How does the device plug in work into this? So let's look at that. So first of all, why do we need Kubernetes to manage TPUs? I think that is the answer for any of these workloads, but especially for TPUs, because you have so many of them. They're running across many different machines, and you need something that will orchestrate all these workloads, scheduling them in the right place, monitoring the health of them, and making sure they run. So that's where Kubernetes comes in helpful. Now, how do we actually integrate with TPU? We built a TPU device plug-in, just like any other device plug-in that exposes the TPUs as a resource type. And we call it google.com slash TPU, and you put there how many TPU chips you want to access in your workload. Now, scheduling is an interesting thing to talk about when it comes to TPUs. So for TPU devices, it's pretty simple. It's just like a one-to-one mapping. So you have one TPU device. You have a pod. You have a container. It uses X amount of TPUs there. It's pretty simple. TPU slices are a little bit more interesting because, since they're spread across multiple different machines, you need to figure out how do you schedule multiple different replicas of your workload? How do they all kind of intercommunicate? And how do you all set that up? It's kind of like a gang scheduling type of problem in the sense of you need to schedule multiple replicas. They all need to talk to each other, and they all need to be up to ensure that your job continues. So how do we use the Kubernetes kind of primitives to make that actually work? Here's an example how you would set that up and kind of what we recommend. So the first step here is you set up a service. You set up a headless service. And the reason we need to do this is because we want each of those replicas to be able to have a DNS name because all of those replicas are going to have to talk to all the other replicas to actually do work here. And since we have to set up multiple replicas, we're going to use the Kubernetes job to represent this. So here, we're setting up a pod slice job. We're going to use an index job. And here, we set the node selector. We're going to use a special version of TPU. They come in different kind of topologies and so forth. We pick this topology. And because it has this number of this topology, we have a certain number of chips per node. If you do the math here, you get like four. You end up actually with four different nodes. And we want to schedule a pod on all those nodes, right? So that's why we set completions and parallelism to four there. Then you set up some environment variables. And then you set up your TPU worker host names. So these are actually like the DNS entries, right? How the TPU software will be able to find all the other TPUs that are out there, right? Because you have multiple of them. They all need to intercommunicate. So that's where we specify the DNS entries there. Then we actually have your workload. This workload, it's like installing the libTPU software. It's just doing the hello world, printing how many TPU cores there are. And then finally, you set your requests, right? And this is where we set that Google.com TPU for. This is where the device plug-in, the scheduler, will actually see access to that device. So this is kind of how you would set up a workload for TPU pod slices, right? Using all these existing Kubernetes primitives. So I want to give you a quick demo of how this works. In this demo, what we're doing is we're doing inference. So there's like this chat GPTJ model. We're using this inference server called SaxML. And we're kind of trying to run this. So I'll give you kind of a quick look at this. So we start by setting up a Kubernetes GKE cluster, just a standard cluster. No TPUs yet. And then the next step is we create a node pool. We are using V5E TPUs. So you just specify kind of the machine type there. We have those TPUs there now. And then now we're going to set up, actually, we're going to use gateway for this because we want to have an inference server. So we're setting up like a subnet and some of the other kind of networking things needed here. So now we're getting the Kubernetes credentials. So we can use kubectl. So we have one TPU machine here running here, right? GKE TPU. And now we're going to deploy all the workloads here, all the Amble manifests. The Sax model is kind of what I showed you earlier. It's kind of like that job configuration. And Sax is like that inference server. So it's going to actually start up and create an endpoint that we can do inference requests against, right? So we're going to wait for everything to start up here. When the inference server starts, it downloads the model from a GCS bucket. And now we're going to try to make some queries against it, right? So we have the IP for it because it's behind a service. So we're going to get the IP for it. And we're going to try to actually make some inference requests. So first, we're going to ask basically like, what is the model that's loaded? And you can see here's the GPTJ model. So you can see we're running one replica. And then from here, we're going to actually make a request, right? So here's your LLM query, right? It actually is like a summarization thing here. So it gives a news article and asks to summarize it. So that's the prompt. And then now we're going to actually do the inference call. And so we do the inference call and we get all the responses and summaries. So it's all done on TPUs. And so we have our response there. And then now, what's the benefit of running this in Kubernetes, right, in GK and Kubernetes? So one is the autoscaling, right? So in the first step, we actually set up an HPA, right? And so this will work regardless of any device you have, right, GPUs, TPUs, whatever it might be. And so now we're going to actually set up a load test. We're going to send a whole bunch of calls to that inference server. And the HPA is going to see it. It's going to spin up more pods. It's going to have to scale up because there's not enough capacity. The cluster autoscaler is going to see that. And it's actually going to provision another node. So you can see here on the UI, it started to update and increase and basically provision a new node. And a new node is going to start up here. And it's going to be able to schedule another pod to handle this increased load. So you can see here the second node started here. So you have two nodes now. And then you can look at the pods. A new pod was created. It's creating. It's already scheduled. And so now it's going to be able to withstand that increased load. And so that's kind of the power of just using Kubernetes and the existing primitives that exist in Kubernetes to run these types of workloads. And so also, all these workloads and all these inference chips, they're all pretty expensive. And so cost optimization is really important. So you want to scale down when you're not using them. And so that's going to happen right now. It's going to see that we're not actually making use. There's no more load. It's going to scale down the deployment. And as a result, it's going to delete those pods. And then the cluster auto is going to see that the node is not being utilized. It's going to delete that node. And we're back to where we started with a single TPU node. So making sure we don't need extra resources if we don't need them. So that's kind of the demo, just to give you kind of a taste of how this all fits together. Cool. Oops. All right. So now that you kind of saw some of the ideas of how you can use it with a real device and how it works underneath, we kind of wanted to give you some tips from an operator standpoint. As an operator, what do you need to be aware of when you're setting up these types of devices? So the first step is you need to actually create a cluster and provision the nodes with the right resources. So with a cloud provider, that might be as simple as creating a node pool with the devices you need. Maybe it's actually getting the devices. That's kind of up to you to figure out. And then once we have that, you need to actually set up those devices. So install the drivers and the device plugins. So on cloud providers, this is usually kind of pretty easy and built in. And then NVIDIA has this NVIDIA GPU operator, which is a kind of collection of components that actually does this for you on common OSs and so forth. It'll install the drivers in the right device plugin and all the components needed to set this up. So what do you do after that? The next step is once you have a cluster full of these devices and nodes with these devices, you need to label them. And the reason for that is because you might have a cluster with a lot of different nodes, with different capabilities, with GPUs, TPUs, different models, et cetera. And it's important to remember that with a device plugin, you have the resource type is nvidia.com. And that's regardless of what type of GPU you have. So if you have a A100 or a T4 or whatever GPU you might have, they're all advertised the same way. So you need node labeling to ensure that the right workloads get to the right devices. So you have to label your nodes based on the resources they contain. Some cloud providers do this automatically. And NVIDIA has, for example, a project called the GPU feature discovery, which does this automatically. They'll figure out what GPUs are on there, label the nodes with the device name. Now, another problem is, since these accelerators are really expensive, you don't want to run workloads that don't need those devices on GPU nodes. And that's why node tainting comes in. So if you taint the nodes with accelerators, by default, a workload that comes in, it won't be scheduled to those nodes unless it explicitly has tolerations. And that's something probably you want because you don't want random web servers or things that don't need GPUs or TPUs or whatever device you have landing on those nodes. Then the last step here, you want to schedule your actual workloads. And you probably want to use a combination of node labels based on the node labels that you set earlier, the tolerations for the taints, and also the requests for whatever device plug-in resource name you want to use. And then lastly, for GPUs and other devices, utilization is super important because these are expensive resources. So there's different resource sharing schemes on basically how do you have this whole GPU? How do you split it up into smaller pieces, smaller chunks to allow multiple pods and multiple different workloads to make use of them? And there are some technologies like MIG, time slicing, NVIDIA, MPS, they can find some more information about that provide that ability. And so lastly, when you're running these simple workloads, you want to monitor them to make sure that they're performing well and they're working as expected. So some cloud providers have built-in metrics for this. On GKE, for example, we have a GKE metric here, duty cycle, that tells you how much your GPU was busy during some period. And we have a TPU metric. It's a tensor core utilization that basically provides an analogous for TPUs. And if you want some more advanced metrics, NVIDIA has a project called the DCGM exporter. It's a Prometheus exporter that provides super detailed information around GPUs. And it also integrates with the KubeLit Pod Resources API so you can actually get a per pod level metric so you can understand how much was this pod utilizing the GPU versus this pod and so forth. So that's kind of the accelerator monitoring. So I want to hand it back to Evan to talk a little bit about the future, where we see devices going in Kubernetes. Cool. Thanks, David. Yeah, so one of the things we're particularly excited about is dynamic resource allocation. Now, this is a new way to request resources that's been available as an alpha feature since Kubernetes 126. And it's an alternative to this counting-based interface that the device plugin provides. So instead of only being able to select sort of whole numbered integer units of devices, it puts full control of the API to request these resources in the hands of third party developers. Now, third party, in this case, generally means device vendors. So third party in the context of Kubernetes. And here, we map entry API objects to vendor-specific APIs that allow for that extensibility. I'm not going to go into the details. There are talks and stuff about that. And this DRA uses CDI behind the scenes to once a CDI spec is available for a particular device, probably being dynamically generated is passed to the container runtime. Now, some of the reasons we're very excited about DRA is that it enables a lot of different functionality that is not really possible in the device plugin at the moment. Because you're no longer tied to this single countable resource name, it supports multiple device types per node. So you can have a heterogeneous node and allow workloads to be targeted at specific devices. Because remember that these node labels that you're able to use are node-specific and are not device-specific. It also allows for explicit sharing of these devices across containers and pods. The device plugin, as it exists, any device request is for a specific container. And there's no way to explicitly share that device across various containers or pods. So if you're running a more exciting, distributed ML application, for example, you may want to share a device across those different pods or containers. You're also able to select resources based on constraints, such as available memory as resource capabilities, CUDA device, sort of the CUDA version, et cetera. And those could be done with labels to some extent, but once again, those are not device-specific. They're node-specific. It also allows for dynamic provisioning of resources. So David mentioned MIG in that you basically take a full GPU and slice it up into hardware slices. And DRA allows you to do that dynamically. So a request comes in requesting a particular type of GPU or slice of a GPU, and it can dynamically allocate that MIG slice on a MIG-enabled GPU and provide that to the pod or container that's specifically requesting it. It also provides better support for enhanced features like MPS. So MPS is something that requires an additional process to be started to allow clients to access a device. And in our DRA driver that we have implemented, you're able to start this MPS daemon as part of that process, so you don't have to worry about that as a user. Now, because of all this extra control, you also are able to right-size your device request for the application that you're trying to use. So if you're just doing some lightweight iterative development in some notebook, you might be able to select a smaller device than if you're trying to run inference or training for a large ML model. And one thing to note is sort of the caveat here is that because of this flexibility, there are some implications regarding to integration with the scheduler and autoscaler. And so there's ongoing discussions in the community around that. So there's still some problems to solve before we're at the point where we can say that this is going GA. So I think this is where there's a call to action, right? And from the community, we need input on what problems you're currently having with using accelerators. And is there anything like what's limiting your current use of the existing device plugin model, right? This will allow us to get more information as to whether or not the APIs that we're busy exposing in DRA as it is designed, are the right APIs or what use cases are important to design for. So yeah, I think provide some links in the slides. But at this point, we can open for questions. Thank you very much. Hey, thank you for the nice talk. Listening to your excellent explanation of the process by which a GPU is discovered and made available in Kubernetes, I went back years in my memories. And I was thinking of extended and expanded RAM if you were born in those times. But at a certain point, computers had different types of additional RAM. And we needed drivers or ways to discover it and make it available to the operating system. And I was thinking today, when I request RAM, I just say how much. And when I request the CPU, I just say what fraction of a CPU I want. And I was thinking, would it not be nice to have the same for the GPU? How much GPU memory I need for my container? And what fraction of the GPU I want to use? So my question to you is, are you aware of any movement in Kubernetes in this direction to make access to the GPU's uniform, transparent, or easy, let's say, which answers your last question, what kind of problems we see using our accelerators? Do you want to answer first, or? I mean, I think in general, I think that's actually, what you mentioned is that ability to, we see devices in the device plugin as kind of static today. I think that's the big thing that you're talking about is you want to basically request a device that needs to repartition on the fly with maybe some more memory or something like that. Then there's a whole question around other devices, which are kind of network attached, and you can attach on the fly maybe some other device or you can add something else. So I think that's actually one of the limitations in the device plugin model today. It doesn't have a very good flexibility for these devices changing on the fly. And DRA is one approach to solve that problem. And we're kind of looking into that approach, because we see more and more devices with these top needs. Yeah, I think maybe there's definitely, if we get to a point where that's an API that we can expose, there is still vendor-specific logic behind that. Because not all devices are fractionally addressable. Where memory is, CPU is to some extent. So I think that's part of what DRA also tries to address is allow that flexibility to, in the background, enable things like MPS, like MIG, to allow fractional sort of access to a certain degree. Sharing is what you end up having, yeah. Thanks. Hello. Yeah, great presentation. Thank you. So I have a question about the DRA there. It's kind of a future, maybe a future. So there is one. You said it could support a different type of a device in Thimnode. That's fantastic. I'm wondering, like, even if it could support a dynamic resource location, but does it also support a dynamic assignment to a different task, a job there, or is it still needed to be pre-configured for them? Or do you have a discussion about that? Sorry, what do you mean by dynamic assignment? So when comes a job task here, like the scheduler could find the best device and send to it, like that kind of thing dynamically? So the definition of best is something that is probably vendor-specific as well, to a large extent. And that's where the selection of devices based on some criteria come into play. So because we have this flexible API, the early implementation we have allows you to say, I want a device that has this CUDA Compute capability, at least this much memory. And you can extend that API as well. So that is then if there are A100s in your system and they meet that specification, then they are selected. If T4s meet that sort of specification, they are selected. I think Kevin gave a virtual talk yesterday where he demonstrates that flexibility in terms of device selection. So as a user, you can select that. So you have to give the scheduler that information. And with that information, the scheduler can then select the node where that device is, and then that device is the one that is associated with that request. OK, got it. So it's kind of still a user that pre-configured them out there to already put kind of a resource pool there, right? Yes, so the user has to specify that upfront. But one can, so DRA is also an API that you can build tools on top of. So some queue or something, insert HandWavy here, can analyze a job, understand what that mapping is, and convert that to a DRA request. And that's actually what our Triton management server does and we have a demo, including that, where on the user, the UX side, you're exposing some other concept. Like, I want an inference GPU. And Triton management server converts that to some equivalent DRA request. So it's not happening in Kubernetes because it's maybe more domain specific. But it provides that flexibility. OK, got it. Thanks. So I have more of an understanding question. So in this setup, when there is, let's say, one GPU or resource or a GPU on a node, can multiple pods use the resource at the same time simultaneously? Or do they have to, like, context-switch out and take turns right now? Yeah, so I think right now, the way the device plugin model works is once that device is allocated to a workload, no other workload can use that device, right? And so some folks have kind of come up with workarounds where, for example, they advertise multiple devices in the device plugin model. And underneath, that's actually the same device, but it's advertised with different names, right? And then multiple workloads can use it, right? But that's actually one of the problems we're trying to solve. And we see as one of the gaps in the device plugin model today, right? It doesn't really have good sharing between different devices, right? And DRA is trying to solve that problem as well, right? Yeah, so there is our device plugin implementation and the one at GKE does support time slicing. But that's just oversubscription. But you, as a user, have no control of which device is actually selected in the end. So it could be that you end up on a device that already has other processes running. And there's also not the same memory isolation guarantees that you may have with something like MPS. Like, there's isolation. You can't read another processes memory. But if that process uses all the memory, then you could, for example, write on the device level. So there is some support for it, but it's more of a workaround at the moment. Gotcha. So will there be a dedicated hardware-based support for that? Like, CPUs have this virtualization technology for that, right? Like, would there be a future support for something like that, like splitting existing hardware resources? So I mean, you also have, in terms of hardware support, you have something like MIG, which is hardware level partitioning of a larger device. And what we have there is you expose each one of those slices, those hardware slices, as a separate device. And you can also layer time slicing or MPS on top of that. So you're able to use the various sharing mechanisms in combination and sort of select something that works for you. So it's a multi-level sort of sharing problem there. I don't know if that answers your question. In the links I provided is one of them is regarding sharing in Kubernetes of our existing device plug-in, and so you can have a look there and maybe that answers some of your questions further. OK, sounds good. Thank you. Cool, thanks. Thanks for the great talk. And I have a question about the DRA. What's the largest broker for making it GA? Sorry. What's the largest broker for making it GA? For DRA, the largest blocker. I think so. There's a little thing called the scheduler and the autoscaler. No, so the current implementation, it communicates with the scheduler through the API server, and that adds quite a bit of scheduling overhead. And because the scheduler is sequential for a large part of its sort of like scheduling process, things that take long there ends up blocking all other pods that should be scheduled. So even if those are not using claims that are associated with devices, I think that's sort of a rough summary. Kevin can add more context there at the front, but sort of the rough summary is that it slows down scheduling because of the communication between the controller component of your DRA driver and the scheduler. And the other issue is that on the autoscaling side, because we have this flexibility in the API, the autoscaler doesn't have all the information it needs to perform the simulations it needs to perform to know which resources to know it's to scale up. So there needs to be some API defined between the autoscaler and the device driver controller of the devices to get that information. And because that will also introduce latency. Hi. OK. Yeah. Because you're introducing latency there, and the autoscaler does, I think, order of magnitude more computations and communications with that driver, the concern is that that's going to slow things down, and autoscaling is not going to be responsive. So those are the two main blockers at the moment. So one is for the pod scheduling, and the other is for autoscaling, right? Yes. Do we have a benchmark for it? What's the current number? Like, how many pods we can schedule in one second? I don't have that information. I think Patrick has done benchmarks. So Patrick Orly from Intel is the one that's been driving the sort of the DRA cap. And he's done benchmarking, but I don't have numbers on that or what the target is. So maybe ask him, there's a DRA dev like at on the Kubernetes Slack channels. So you can just ask there. Thank you. And yeah, so I think part of the discussions that came out of this conference is that we're going to at least start an informal working group to start with with interested parties from scheduler, from autoscaler, and the DRA developers to try and sort of address, at least understand these problems a bit better and then start addressing them and see what's required to move forward. Yeah, that's great. Thank you. So this is a question for David. I was looking at your slides and you showed there's different types of TPUs that you can allocate. So have you considered allocating, do you consider TPUs as a possible use case for dynamically for DRA, for dynamically allocating different types of TPUs based on workloads? Yeah, it's actually a really good question and something we looked into. So right now, the way it works, right? I think when we look at DRA, DRA is really good when the devices can change, right? So with GPUs, you can use MIG, you can split them up and so forth, right? With TPUs today, when you bring up a TPU machine with some certain topology, that topology doesn't change over the course of the lifetime of that node, right? And so since the device itself is pretty static, it actually fits pretty well in the device plug-in model, right? But that's the current state with TPUs. So if things change, maybe the DRA approach or something that's more dynamic where devices can repartition will be more useful for it. Yeah, sorry. The question is, does your NVIDIA operator or the device plug-in can do those time-slicing, MIG, and MPS? Or I have to go for a special tool? No, so yeah. Oh, OK. Thank you. Oh, OK. All right, thank you. Hey, apologies if this question has already been asked, but is there thoughts on enabling C groups for DRA? So can I have a pod with guaranteed QoS and maybe a best effort one and have C group sort of enforcement across the two with DRA? I know that there's a discussion around QoS classes, which is somewhat related to DRA, but I don't think that DRA is trying to solve that problem necessarily. So I don't have the right answers there, but I think reach out to someone like Alexander Kinefsky from Intel. He might have a better response. Thanks. Sorry, maybe David has more input there. Just one thing to add, I think with the C groups today, the C groups are built into the kernel, and the devices like CPU and memory and those things are all very uniform, so it's very easy to make one implementation that works for them. But when you have these devices that all have different properties and so forth, it's hard to make a C group controller for them, so that's why we have these vendor-specific things. But there are maybe certain things that we can look at maybe that all devices have that C groups can help with. So something interesting to look at, for sure. Thank you. Cool, thank you so much. Thank you.