 Hello, everyone. Welcome to our session, Sharing is Caring, GPU Sharing and CDI and Device Plugins. My name is David Porter. I'm from Google. I work on the Google Kubernetes Engine team, where I work on Node and kind of our integration with accelerators. And this is Chris. Hey, I'm Chris Destiniotis. I'm a software engineer at NVIDIA. I work on our cloud native engineering team that works on enabling GPUs in the container runtime ecosystem and in Kubernetes. Cool. So to get started, let me just kind of set the landscape. So the landscape is everyone is well aware during this KubeCon. Devices are becoming increasingly important in Kubernetes, right? So we have all these new workloads, things like inference, training, fine-tuning, et cetera. And all of them are requiring devices. So it's important that devices and accelerators are kind of well integrated in Kubernetes, just like other resources are today. So what we're going to cover today, we're going to start by talking a little bit about how devices are integrated in Kubernetes today. Then we're going to kind of go one level lower at the container runtime level and take a look at how container runtimes integrate with devices and how CDI is a new technology that's going to help enable that. Then we're going to go up the stack a little bit and look at kind of resource management with some of these devices. And we're going to focus on GPUs and how GPU sharing can be enabled. And lastly, we're going to end with a little bit about the future of devices and device sharing in Kubernetes. So I think when we think about it kind of from the Kubernetes perspective, right? I think one of the big reasons Kubernetes is so successful and everybody really loves running Kubernetes is because it's really good at resource management. What Kubernetes allows you to do is take your whole fleet of nodes, right, with all different types of resources like CPU, memory, disk space, and really allow you to be able to manage them under one control plane with one API. And devices are now becoming increasingly important in that resource management space. And we want to enable you to manage them just as effectively as you do with CPU, memory, and disk today. So what we're trying to do from the community perspective is we're trying to build open standards here and common resource models, both to actually consume and interact with these devices and to share the resources. So we're going to be talking about what exists today. And what exists today is the device plugin framework in Kubernetes and CDI. Those are the open standards and frameworks that allow you to interact with devices today. In the future, and something that's been discussed during this KubeCon has been a new feature called DRA. And DRA is going to be a new API and new way to manage these resources and share them more natively. So let's talk a little bit about devices in Kubernetes. What is a device when we talk about devices here? So a device, we'd like to think about it as this abstract concept, an abstract resource that the user wants to use for some specific purpose. And it may be that it's a physical hardware device, but it may not be. And so usually, these devices, they don't just come in isolation. There's kind of a whole ecosystem around them. So usually, if it's a hardware device, you'll actually need some kind of kernel drivers. You'll need to access some device nodes to actually talk to that device. And probably, you'll need some libraries and utilities as well to actually be able to use that device in your application. So this whole bundle of things is usually what we call the device. And so in Kubernetes today, we have this API called the Extended Resources API. And Extended Resources API allows in Kubernetes to basically advertise resources under the node, just like CPU and memory. And we can advertise these devices and account. So the devices are accountable and have an integer account associated with them. The way that devices are integrated in Kubernetes is with something called the Device Plugin Framework. And the Device Plugin Framework is a plug-in framework that the vendor of the device usually writes. And so the Device Plugin Framework usually has three calls here. It has a register call, an allocate call, and kind of health checking. So usually, when the device starts up, it actually does some registration. And what this means is that the device will start up, and it'll actually talk to Kubelet, and it'll advertise a name. So for example, for GPUs, the name is invidia.com slash GPU, and it'll advertise account, how many of that resource we have. Then when a workload comes in, there's the allocation step. The allocation step is when we actually figure out what device to allocate for that workload and how to modify the container so that the workload can actually access the device. So things that are like mounting the device nodes, the libraries, et cetera. And lastly, the device plug-ins responsible for health checking. So if the device goes unhealthy, the device plug-in will react to that and update the node about that. So let's walk through kind of an example of how device plug-ins actually work in Kubernetes today. So let's start off here with a node. Let's imagine we provisioned a node either on-prem or in a cloud provider. You already have the GPUs attached to it. The first thing you would want to do is actually set up the kernel GPU drivers here. So on a cloud provider, this is managed for you. There's also the invidia GPU operator that installs these drivers for you. So you set that up. You install it, and then you install the device plug-in, which again can be managed by a call-up provider or the operator. And the device plug-in starts up, and the device plug-in is usually just a pod that's running there. So the first thing a device plug-in does is actually communicate with the GPU, communicate with the GPU drivers, and figure out how many of those GPUs there are. From then, the kubla is going to go ahead and basically see that there's a new device plug-in and ask to do the registration. So during the registration, the device plug-in is basically going to say, I have a resource. That resource name is invidia.com.gpu, like that's the name. And I have two of them. And that's going to be communicated back to kubla. And when kubla sees that, it's going to go actually up to the API server on the control plane. And already, the kubla is responsible to update the node object capacity field. And it'll update the capacity field. And just in addition to your CPU and memory, you'll have a new entry there, invidia.com.gpu, too. So at this point, all the components in the control plane can see that this node has this resource attached. So at some later time, you have a workload that comes in. And this is on the left side of the diagram here. You have a pod come in. And under the container request there, it'll request some number of that device. So in this example, it's requesting invidia.com.gpu1. It'll come in. And then it'll go up to the scheduler. And the scheduler is going to see, hey, this node has that resource available. It's going to go ahead and do that scheduling. And so after it does that scheduling, the kubla is going to see that, hey, it needs to start a new pod. At this point, kubla is going to communicate to the device plugin and actually do that allocation step I mentioned earlier. So it's going to figure out, hey, what GPU do I give it? And how do I make that GPU accessible in the container? What modifications need to be made to the container to access that GPU? Once that's done, kubla goes ahead and goes to the container runtime. So this is like container D or cryo and actually send that container spec there for it to start. Container D or cryo just pass it along to the low-level container runtime like Run C. Run C goes ahead and actually starts the container, does all the mounting and so forth. And at that point, the workload can start. And now the workload can actually access the device like the GPU. And at this point, the workload is directly talking to the GPU. It's not talking to any of these Kubernetes components like the device plugin when it's actually running. However, with some devices, the story is not so straightforward and so simple. And so Chris here is going to talk a little bit around some of the extra complexities and how CDI is going to help address them. Thanks, David. So yeah, so for more complex devices, there's actually more that comes into the picture than what was shown on the previous slide. So for NVIDIA GPUs, you not only need to install a driver and device plugin, but you also need something called NVIDIA Container Toolkit, which is a set of tools that ensure that GPU containers can be set up with all the right things they need to access to GPU. So this diagram on this slide is just like a sequence diagram from the point when Cubelet allocates a GPU, one or more GPUs for your container to all the way down to when your actual container is run. This is sort of what it looks like today. And historically, there's some NVIDIA specific components that under the hood make sure that your container has access to all of the right device nodes, driver libraries, and so forth, so that your CUDA application can run transparently. The major sort of component that's doing all the heavy lifting is step nine, this pre-start hook. So it's this library that's invoked as a pre-start hook before your container application. The main process starts. It is a lot of injecting of device nodes, mounting on the driver libraries and everything you need. And it's doing this sort of behind the knowledge of a container runtime like Run C. So there's some downsides here. One is this is not declarative. Like all of the edits to the container's environment are being happened under the hood, and they're not encapsulated in the container's spec. So the OCA runtime spec is the standard for that Run C uses to set up a container's environment. Run C has no idea that some of these things are actually being included in the container. So historically, that's led to some hard to debug issues and some inconsistencies when interacting with GPU containers. So ideally, we want to provide a standard to sort of overcome this and do things in a declarative way. And so there's a project called CDI which aims to solve these problems and standardize how we access GPUs and other accelerators and hardware in containers. So what is CDI? It's a container device interface. It's a CNCF-sponsored project under Tag Run Time. And like I said, previously, it aims to standardize how containers or how third-party devices are made available to containers. The way it does this is it uses a declarative specification. And it defines in that specification, you define as a vendor, you define what your device actually means, what access in the device means. That can be a list of device nodes, mounts, environment variables, and even container runtime or even container lifecycle hooks can be included. And each one of these maps to a set of modifications that need to be made to a container spec. So everything is sort of encapsulated declaratively, and it's encoded in the container spec. What happens is you request a CDI device and a container runtime which actually understands CDI can take that device, read the spec for it, and modify the container spec, the OCI runtime spec, so that your container has access to the device. So it's all done declaratively. The way that we name devices, there's a taxonomy that CDI uses for naming devices. It's a vendor slash class equals name. So for an example, for NVIDIA GPUs, it's a nvidia.com slash GPU equals and then some ID. We can use arbitrary naming. So like I said before, this is a declarative approach. And some of the main benefits are that low-level runtimes like RunC are doing the heavy lifting for us. They RunC is specialized in setting up the container's runtime environment. So we are leveraging it to actually provide access to accelerators and devices. And having support in container D in cryo means that we no longer need some of the vendor-specific tooling to make this all work under the surface. So here's an example of what a CDI spec looks like for an NVIDIA GPU. So this is a system where you have only a single GPU. So under the devices section, we just have one entry called GPU0. And we have a series of container edits. So these are edits that need to be made to your container spec so that you can get access to GPU0. This is device nodes, like the NVIDIA character device. Mounts. So like driver libraries, like libcuda, it specifies where in the host that library is and where it should be mounted in the container at what path. As well as container lifecycle hooks. So a simple one is updating the LD cache. So this is a standard hook that we like to use so that your main process automatically can discover the driver libraries when it runs. So that's a brief introduction to what CDI is and what the goals of the project are and some of the benefits. Where it can use it today. So it's actually supported in Kubernetes so that a CDI devices field was added to the device plugin API so that device plugins, when they provide an allocate response, instead of providing other information, they can provide CDI device names in that response field and that response struct. The CRI was also extended to contain a CDI device field and the container config message starting in v0.27.0. And like I mentioned, I think earlier, container dncryo already supports CDI so they can understand the specification and modify a container spec for us. So this has all been discussed in the context of Kubernetes device plugins, but CDI is actually a standard that's used outside and is meaningful outside of the device plugin use case. So for DRA, CDI is sort of the basis of defining and requesting resources. It's also useful outside of Kubernetes. So if you're launching containers interactively with Docker or Podman, you can use CDI there. And actually for NVIDIA GPUs with Podman, CDI is actually the recommended way today to access GPUs. For HPC containers, there's been some support for CDI in Singularity, which is a popular HPC runtime. And we can think of even other use cases. So even for networking devices, we can imagine a future of using CDI for that as well. So going back to my sequence diagram from a few slides, this is how I would look with CDI if our device plugin was leveraging CDI. You'll notice that the container runtime steps, there are no NVIDIA-specific components. The assumption here is that before running applications, we have generated a CDI spec for all of the GPUs on the system. So either some tooling is invoked or the device plugin itself can generate these specs that describe all the GPUs in the system and what it means to access those. So those are stored somewhere as a file on the system, a standard path that Runtimes know to look for. So that's done beforehand. But at runtime, what happens is container to your cryo, right, steps five through seven, it will read the CDI spec file. For all the requested GPUs for the container, it will update the container specs so that it has the right bit so that it can access that GPU. If for is that container spec to run C, run C creates your container with that specification and your container can run and has access to the devices that were requested. So this is all declarative. There's no vendor-specific, so actually this is a diagram that can be applicable for not just NVIDIA GPUs, but any other accelerator or device. Portable across container Runtimes and we're not using these vendor-specific hooks that have a lot of drawbacks. Okay, so we went into detail about device plugins and sort of the lower level details about container Runtimes and some standards that are being developed. I think we're gonna switch gears. I'm gonna go back to Dave and we're gonna talk about how can we leverage device plugins to more effectively share GPUs. Thanks, Chris. So let's jump gears a little bit more and go one level up. And so with these devices, it's becoming increasingly important to find ways to kind of maximize their utilization, right? So especially with GPUs, they're very expensive, they're limited and everybody right now is trying to figure out how we can squeeze kind of the most from them, right? So to do that, what we're really trying to do is to find ways that we can take these physical devices that we already have and be able to break them up and partition them so we can actually run multiple workloads simultaneously on them. And so we're gonna be looking specifically at GPU and the ways that we can share GPUs today. And so there's three main ways that you can do that. The first one is called time slicing. The second one is called multi-instance GPU MIG. And the third one, which is kind of a newer one, is called CUDA multi-process service MPS. So we'll go into each detail, into each one, how it works and kind of the trade-offs of when you might want to use which approach. So we'll start with the simplest one, which is time slicing. So normally, right, we just have a single workload that's running on the GPU, just a one-to-one mapping, really simple. With time slicing, what we can do is we can, first of all, define how many workloads we wanna actually be able to run on the GPU at a time. So you can see at the bottom of that command line, you can see there's a parameter that are max shared clients per GPU is 10. So in this case, we're basically saying we're allowing 10 workloads to run on the GPU at that time. And so with time slicing, how it works is we basically have context switching. So a single workload runs, then there's a context switch and then the second workload runs and then it switches back. And so that context switch is a little bit expensive, but it allows us to actually run multiple workloads at a time. And so the way it actually works from the device plugin perspective is instead of actually advertising before the physical number of devices, which was before what nvidia.com GPU represented, now we're kinda advertising the virtual number of devices, so the clients that can be used. The other kinda thing about time slicing that's worth to mention, each workload gets its own address space, but there's no memory limits enforced. So a workload can consume all the memory and that can disrupt other workloads. The other thing that worked to mention here is that in time slicing, each workload actually only runs at a single time, right? There's only one workload that's running. So there's concurrency here, but no parallelism. At a single time, there's only one workload that's actually running on the GPU. So that's available today and you can use time slicing. And then also with the nvidia GPU device plugin, the way to enable time slicing, it's pretty similar. You specify how many clients you wanna use the GPU. Here it's 10. And now under the nvidia.com slash GPU capacity that's reported instead of reporting one GPU and now reporting 10 GPUs. So kind of the meaning here has changed from the physical number of GPUs to the clients that can use the GPU. So when is time slicing useful? So one scenario where I think it can be very useful is kind of for interactive workloads, kind of more exploratory workloads. And so sometimes these workloads like Jupyter notebooks, if you're familiar, where you can kind of do data science experiments and kind of machine learning, kind of explorations are great candidates because they're kind of interactive. Sometimes they need a lot of resources, sometimes they don't. You probably, they're mostly for exploration so we don't have super strong kind of guarantees that we need to provide. So let's take a quick look at actually using time slicing in action. So what we're gonna do here is we're first gonna provision a cluster. So we provision a cluster here on GKE on 129. After that, I'm gonna set up a node pool, which is a set of nodes, and I'm gonna configure those nodes to be in time sharing mode. So I'm gonna specify the GPU time sharing strategy and I'm gonna say that I have three GPUs that I wanna do time sharing with. So I go ahead and create that node pool. It's gonna create the capacity in the node. I'm just creating one node here kind of for demonstration purposes. After that's there, I'm gonna use KubeCuttle and actually see like how many nodes I have. So I'm running KubeCuttle get nodes. I have two nodes available here. One of them is just a CPU node and then one is actually the one that we just provisioned with the GPUs. So I'm just gonna set an environment variable GPU node just so we can kind of play around with it. And the first thing I'm gonna do is I'm gonna look at the allocatable that's being reported from the API server. And you can see here, we have nvidia.comGP reported and we have three GPUs that are reported. So this node only has one GPU. I provisioned an A100 GPU on it. So it only has one physical GPU but due to the time sharing configuration, it's actually reporting that we have three GPUs here. So let's go ahead and we're gonna try running a workload here. So the first thing I'm gonna do is I'm gonna label the node with notebook node true and I'm gonna just use that as the node selector for the pods that I'm gonna deploy. So for the purposes of the demo, you can imagine here we're gonna deploy two Jupyter notebooks and you can imagine maybe there are like two different data scientists or researchers who are wanting to play around and use these notebooks. So I created two pods here, notebook pod one and in the node selector here, you can see I'm specifying I want an A100 GPU. I'm gonna use time sharing. This is the one where I set up the three clients and from the resource perspective, I'm just specifying I need one GPU here and then for the actual workload, I'm just specifying like the Jupyter notebook to start up. And so that's my first pod and then I have a second pod here which is basically the same thing, just a different pod that we can play around with the only thing I'm running it under different ports we can kind of explore it differently. So what we're gonna do, we're gonna deploy both of those pods here, notebook pod one and notebook pod two. Cool, and then we're gonna look at those pods started up, they're running, awesome. So now let's try to access those notebooks. So those notebooks, I'm gonna use kubectl port forward and I'm gonna port forward them on two different ports. And so I'm gonna start up the first, I'm gonna go to that port forward and look at the Jupyter notebook, you can see it started here on port eight eight eight and the other one I port forwarded on eight eight nine and you can see it's also running here. So awesome, they're both running but they're using time sharing but we're not actually using anything on the GPU yet. So let's actually run some GPU workload to try out the time sharing. So what I'm gonna do here, I'm gonna install some kind of libraries and then I'm gonna first of all use PyTorch and I'm gonna access the GPUs from PyTorch and the PyTorch is able to access the GPUs. I'm just printing the GPU kind of model here. So now let's try to actually run a workload. So I'm gonna use HuggingFace, so I just gonna log in with the HuggingFace credentials so I can download a model and I'm gonna actually try doing the Gemma kind of LLM model. So I'm using the open source Gemma model which is like the small two billion model here and I'm gonna try to do inference on it. So I'm gonna type in Kubernetes is the best and try to run inference on that model. So I'm gonna run it, now it's actually loading the model to the GPU and you can see here it's gonna load it here and we can see the LLM output at the bottom. Kubernetes is the best way to deploy and manage containerized applications. Cool, so it makes sense. So now you can imagine some later time like some other user wants to use that other notebook, right? So we're gonna switch over to the other notebook, different user and let's say they wanna run a different workload, maybe they wanna try a different model here. So very similar, I'm just gonna try the same thing but this user they're gonna try a different model. They're gonna try the larger Gemma seven billion parameter model. So same type of thing, different model though and we're gonna try a different input text, type in Kubernetes is great and see what the response is. It says Kubernetes is great and it's hard, it's hard to get started. So hopefully this talk helps a little bit with that. So there we go, we can see we're running both of those things. If we look at the metrics here that are reported, we can look at actually the metrics that are reported for those GPUs. We can see the VRAM usage here. You can see it spiked to around 20 gigs and then it jumped to 40. So when the first notebook user started playing around with the first model it allocated all that VRAM and then the second one started allocating more memory for the second model and it jumped up to 40. And just to verify that we can actually log into the node and I'm gonna just SSH into the node. I'm gonna run NVIDIA SMI and you can actually see here from NVIDIA SMI it shows like both Python processes running there and both allocated that amount of GPU memory which corresponds kind of to the metrics we saw earlier. So there you go, that's kind of like a demo of how you might use time slicing. So you can try that out today. So I'm gonna hand it back to Chris to talk a little bit about some of the other GPU sharing strategies. Cool. Yeah, thanks for the demo, David. That was really great. There we go. So there are two other sharing strategies we wanna cover in this talk. So the next one is CUDA MPS. This allows, this takes a next step above time slicing in the sense that it allows you to logically partition the GPU in terms of memory and compute. So it's all done in software, just like time slicing. So you can have multiple clients to the GPU using at the same time concurrently but also it sort of allows you to do some level of parallelism. So you can have clients A, B and C running GPU kernels on the same GPU at the same time and it's all sort of facilitated by MPS service or server process that's running. And so it sort of sits in between your actual clients and instructions running the GPU. So in software it can sort of enforce some sort of memory limits or enforce memory limits for each GPU client and compute resources in terms of active thread percentage. So it can lead to better utilization of your GPU in terms when compared with time slicing. And the one implementation detail is here is that we have one server that facilitates the GPU clients and running instructions in the GPU. So there's one shared CUDA context that's good in terms of context switching because it's not as much of an overhead between different clients running the same GPU as one compared with time slicing where you are doing a full context switch between two different CUDA contexts. But at the same time, in terms of like fault domains, like this is a shared fault domain. So if client A triggers some sort of fatal error on the GPU, it will affect clients B and C. In the NVIDIA GPU device plugin, you can configure MPS like this. It's almost very similar to the time slicing configuration. You just specify MPS instead of time slicing as a sharing strategy and you configure how many replicas. So how many concurrent processes you want to support running at the same time on the GPU. And so if we describe the node, we'll see similar to with time sharing, your one physical GPU is now advertised as 10, as being 10. In terms of support, so we have a new release of the device plugin coming out very soon, the 0.15.0 release, and that will have official support for MPS. We have some release candidates out already that people are trying out, trying MPS out with. The other sharing strategy is MIG. So multi-instance GPU, right? It's a hardware feature of some of the later generation NVIDIA GPUs that allows you to take a full GPU and some divide it into multiple, what we call GPU instances, right? And each of these has dedicated compute and memory resources. So you can have up to seven of these slices on your GPU and there's some standard like profile names that if you look at the MIG documentation, you can sort of partition your GPU into those fixed size slices. So on the bottom, we have an example of enabling MIG on GKE. So on your node pool with A100s, you can specify the partition size of your MIG instances, in this case, one G5GB, which is the smallest slice. So we can have seven of these, each with one sort of dedicated compute instance and five gigabytes of GPU memory. So how do we enable MIG in the device plugin? So the assumption here is that an admin or some automation has gone in and already enabled MIG on your GPU on the node and has already sliced the GPU into MIG instances. So that has to be done beforehand. So there is some static sort of configuration that has to happen. You have to sort of know what size profiles you want and configure the GPU in that way. The configuration of the device plugin is really simple. The device plugin will actually enumerate all of your MIG instances. And so if you have seven of the small instance types, then your node will have seven mba.com slash GPU resources allocatable. Instead of configuring MIG yourself, you can use a GPU operator. We have a component called a MIG manager that helps automate the enabling of MIG and configuring of MIG instances. So I'll hand it back to David who's gonna do a little comparison and summarize our talk. Cool. So now that you've kind of heard around those different resource sharing strategies, you might be asking yourself, when do I know which one to use and what are the trade-offs between them? So I guess to start off, the simplest one is the one we started with, which is time slicing. Time slicing, the big benefit of it, it's very dynamic. So you can add workloads on the fly, you can remove workloads on the fly, you can start out with just one workload and then it's identical to just not using resource sharing and as you add more workloads, you can kind of get more workloads to share that GPU. The negative with time slicing is you can start to introduce some kind of jitter, et cetera, because of that context switching overhead and you don't have strong kind of resource management guarantees because there are no memory limits that are available. So that's kind of what brings us to MPS. And MPS can be thought of as kind of an improvement over time slicing because you actually do get stronger guarantees there, right? You do get performance isolation, you can specify how much compute each application gets and you can specify memory limits so you can actually enforce memory limits per application. And then MIG is actually the one that gives you kind of the strongest guarantees because it actually partitions things at the hardware level, right? So that means that at the hardware level you have separate kind of memory that you can provide and you have separate kind of performance guarantees because it's actually split at hardware. The kind of the negative with MIG and kind of the downside is that you actually need to figure out how to partition your GPU kind of ahead of time, right? And it's not dynamic. So when there's no workloads running you have to kind of go in and understand, okay, depending on my workload size which MIG partition sizes make sense for my application. It works well, however, for example, for multi-tenant scenarios, right? Where you really do need strong kind of security and other types of isolation guarantees between workloads. So in summary, I think what we talked about today, right? Is we started at the Kubernetes device plugin framework and that's the existing framework that exists in the ecosystem to expose and expose devices like GPUs, TPUs, and FPGAs. And we also dove one level kind of lower at the container runtime level and gave you some information about CDI which is the new standard of how a device will be integrated at the container runtime level. Then we looked at GPU sharing and GPU sharing is a great way to improve utilization of your resources and we have different strategies there with their different trade-offs. And I think the big message we're trying to send is like as a community what we're trying to do is extend this resource model, make it really vendor agnostic, make it standard and make it kind of natively supported by Kubernetes with these devices. And so what we're doing in the future is we're trying to extend this model here, right? So this resource sharing, it's not natively supported and it's not very declarative because we had to kind of overload the meaning of what does the device mean depending on the resource sharing strategy. So we're trying to come up with new APIs to make it more declarative. And so in DRA, we're hoping that we can be able to express kind of some of these resource sharing mechanisms kind of more natively. And so it'd be really great to get your feedback around kind of some of what are kind of the challenges today with the device plugin and CDI and what you would like in the future in Kubernetes. Your feedback is really appreciated there. And so a couple of things I wanna shout out is there's some related talks both of this KubeCon that have already happened and in past KubeCons as well that go in more detail around devices in Kubernetes as well as GPU sharing which are some great resources there. So thank you so much. Any questions? Cool, I think there's Mike. Can we expect in some future that you provide the memory limit or so for time slicing? It's technically possible but it's not implemented in the driver. Yeah, I don't think that's currently planned. I think what some folks do is they actually enforce those memory limits at the application level. So a lot of applications, you can actually enforce memory limits at the application level but I don't think there's currently plans to actually support that natively with time slicing and the recommendation would be to use MPS. Yeah, thank you. Thank you for the talk. So my question would be can you combine MPS and time slicing either today or tomorrow? So saying that I wanna use MPS until the GPU is full basically and then start time slicing for additional workloads. I guess theoretically you could but in practice I don't think we've investigated or documented how to do that for our cloud native users. Okay, so it's theoretically possible today or? I think so. Thank you. You can also combine for example like MIG and time slicing as well, right? So you can do time slicing on MIG partitions as well for example. Yeah, you would need like multiple MPS server processes, right, and then they would time slice. Yeah, I think it's theoretically possible, yeah. So the example you gave, I guess it's now fractionalization is becoming a standard. Can I use those features in the new operator without DRA? I mean, this is enabling DRA but I can still do these things now, right? Yeah, so everything we showed is not using DRA. This is using the device plugin API. So all the sharing strategies we showed you are available and supported in the operator and the components that we deploy. And the new operator is about to be released with support for, I think you mentioned earlier that it's a new version of that operator. Yeah, so we're releasing a new version of the device plugin and then also the operator will pick that up next month sometime, hopefully, to support MPS. Okay, so if we all start learning to use this way of allocating fractions, then in a year's time or so the DRA will be the new way instead. So DRA will provide potentially like a new API to actually consume these in your workload so your pod spec may change but the underlying actually technology of those different research strategies that's not specific to DRA. DRA is just kind of the declarative API representation of how we wanna consume those. Okay, yeah, thanks. Thanks for the presentation. So my question is, is this possible to share something about how GKE supports time slicing? Is the mechanism, is this possible? How it's implemented or asked? How GKE supports time slicing? Yeah, so the way GKE supports time slicing is kind of the regular way, it's the same as with the NVIDIA GPU device plugin. So you just specify how many clients can use the GPU at a time, right? And then it'll basically just schedule those workloads there. And the device plugin will advertise the number of clients that you specify. So there's nothing kind of special done on GKE versus the NVIDIA device plugin. The underlying mechanism is kind of the same. Okay, thank you. Hi. Great talk. At the moment, we use time slicing for a combination of GL workloads and CUDA workloads. Is that what we can also expect from MPS to... I think it should work transparently, yeah. It's just a different way of sharing the GPU. I don't think the type of application you run necessarily matters. Okay, yeah, because you mentioned the contacts and that sounded very CUDA specific or... I actually don't know if there's implications there using the same CUDA contacts. I actually don't know. Okay, yeah. My understanding, the goal is for it to be compatible like for most workloads out of the box, right? So I think the expectation you shouldn't need to change your workloads significantly to pick up MPS from time slicing, yeah. And the GL workloads, will they remain to be supported? Because we're noticing a bit that the support for the container-based images, for example, for the GL has been lacking for the last few years. So can we expect that to remain supported by the cloud run times? Are you asking about the Docker? Like what specifically? Yeah, exactly, yeah, yeah. I'm not fully aware, honestly, around the support there. Yeah, me either, unfortunately, yeah. We'll hold you to check. Thanks. Hello, thank you very much for the great talk. One question would be for time slicing. Is there, let's say, a soft limit depending on the cluster size and how many compute nodes you have where it, like, tanks the performance in a way? Yeah, so for time slicing, it's all kind of based on a single node, right? So the number of nodes in your cluster and so forth, I don't think are actually important factors. It's the number of workloads that are running concurrently, right, on a single node. So I think the main factor is how many GPUs do you actually physically have on that node and how many clients do you have connected to it, right? The more clients are gonna have the less runtime each workload can have. So you can't use time slicing together with RDMA between, like, more nodes. RDMA connected nodes to, like, have a bigger pool of time slice GPUs, right? I mean, I don't see why you couldn't use time slicing with other technology, right? When the workload is actually running on the GPU, it's the only thing running, right? So you could use RDMA or any other techniques you're using today. Thank you very much. Yeah, thanks for the talk. I have one question for time slicing. It's currently the only solution that might have problems with workers fighting for GPU consumption. So it is more agile than all the others. Is there a way currently to natively achieve process prioritization? So you can say, okay, these processes here are coming first and then these are going to wait, or is this not currently possible? Yeah, I don't think that's possible today. I think you can customize, like, the time quota that's given for each application when it's time slice. So I think there's a configuration option in NVIDIA, somebody can say, like, short for the context window, but I don't think there's prioritization built in there. Because at a single moment, right, there's only one application that's running, and they all get the same share. Okay, thank you. All right, if there are no other questions, thank you, everyone. Yeah, thank you.