 I'm Shiva Merla. I'm from the cloud-native engineering team at NVIDIA. I've been focusing on GPU operator for the last few years and mainly working on GPU orchestration in Kubernetes. So we had a long journey in terms of using operator pattern to make it easy to consume GPUs in Kubernetes. So today, we're going to talk about how we went through this approach and the learnings we had through this journey. Introduce. My name's Kevin Cluz. I'm on the same team as Shiva at NVIDIA. And the way I always pitch the team that we have, what we do is we do everything that's necessary to enable GPU support in containers in Kubernetes. And then we build all the tooling around that to make using GPUs in this environment easier. And so yeah, the operator is our way to package all these things together and make it that much easier for you guys to take advantage of all of these things on Kubernetes. Thanks, Kevin. So this is the outline for the talk today. We're going to talk about why GPUs have become so ubiquitous and why GPUs in Kubernetes. And also, we'll walk through how the typical GPU software stack looks like. What are the main operational pain points we have when enabling GPUs, GPU software stack. And then we'll delve into operator pattern, how we have implemented GPU operator and some of the technical details of the GPU operator itself. Then we'll end with a demo and some of the lessons that we learned through this journey. So why GPUs are so popular, right? It's no surprise given the massive computational power and given the way we took a giant leap in terms of computational capacity over the last decade, they become so ubiquitous in Kubernetes to run AML jobs, deep learning, even in the scientific research. So it becomes so common everywhere. And recently, we have announced Blackwell, which took a giant leap in terms of computational capability. And Kubernetes, on the other hand, also over the last few years, have become a de facto standard to run AML jobs, be it in AML, deep learning, in the scientific fields, or data processing. So everywhere, Kubernetes have become a de facto standard because of its scalability, easy to scale your applications because of its resiliency, and also the way to manage these applications seamlessly. So it's become so common to use Kubernetes. So what do we need to enable GPUs in Kubernetes? So typically, we start with a device driver. Any kind of GPU, we have a device driver to be installed on the host. And we have some sort of hooks to enable with the container runtime. Because sometimes, we don't have the native support in the container runtime itself to enable the GPUs. So we need these custom runtime handlers to be installed on the host. We also have a Kubernetes standard device plugin. So we have a device plugin framework in Kubernetes to advertise GPUs, discover and advertise GPUs to the kubelet. And also, it's common to have a GPU monitoring software. We want to monitor GPU performance. So we want to get certain alerts in cases when they go faulty. So it's also common to have a GPU monitoring stack. So this is a typical GPU software stack with any kind of GPU. So what are the pain points that the operational teams are seeing these days when enabling the GPU stack? So we're going to talk about, at the high level, these pain points. We have a heterogeneous node software stack. So I'll talk about the challenges we're having in terms of managing this stack as we add newer GPUs into the cluster or as we release newer versions of the drivers over a period of time, how it becomes a pain point to manage. Also, the driver configuration itself. So most people install drivers on the OS itself. They have to configure certain things on the driver. So I will walk through how that is painful today. And the most common thing that we are also hearing in this conference is how to kind of efficiently use GPUs. It's most common to learn about how we can configure GPUs to share among multiple workloads. So again, the configuration is a per node and also per GPU. And also something that the operational teams have to manage over a period of time. And as I mentioned, we do apply certain container runtime configuration. We have certain hooks to be placed on the host based on the runtime type. So we have to configure. We have to reload the daemon. So there's certain things we need to know. And also monitoring GPU health. So it's also, there is no robust solution today to kind of monitor GPU health and take action. That's also a common pain point. Whenever some GPU becomes unhealthy, how do we monitor that and how do we kind of make sure that we don't schedule anything on that node? So to delve more into the heterogeneous node software stack, typically day zero, we install operating systems. We have a GPU software stack, which I talked about, and a standard Kubernetes stack. Everything works fine. Initial setup works fine. And then we keep adding more nodes with maybe like newer GPUs, newer driver versions. So these versions are constantly changing. We have maybe it might be because of the performance improvements or it is because of the CVEs. So you kind of have to keep updating the drivers over a period of time. So definitely we'll run into some sort of interoperative issues. So we'll run into issues in terms of the container runtime hooks that we have or the driver and the CUDA stack that you have on the node itself. So we'll run into a different sort of interop issues. Also there is a CPU versus GPU stack. So the operational teams have to maintain two different stacks depending on if it's a GPU node or a CPU node because they kind of build drivers into the OS image itself. So we end up using two different OS images. So when the cluster grows in size, so it becomes very common to see how challenging it is to manage it across different nodes and how hard it is to configure these things. When it comes to driver install, so we have various components in the driver. We have kernel mode components. We have user space components. On the user space, we have multiple services to be launched. And also there are the CLI utilities to configure parameters on the GPUs. Again, on a node by node basis, so depending on the version of the driver they have or depending on the type of GPUs they have, so the SRI teams might have configured a few things on these nodes. So again, it becomes challenging to kind of keep track of all these things in the cluster. Another common thing is a per node GPU configuration. So how we partition the GPUs in each of the nodes. So you need to have some sort of declarative mechanism to partition GPUs. This is when done with the standalone tools, so that means how to keep track of the config that is applied on each node, the different settings that are applied on each of the GPUs. So they'll have to kind of keep track of these things on each of the node. And sometimes these are not persistent. Like every time the node reboots, some of these are not persistent. So on every reboot, they'll have to have some sort of unit scripts or service demons to apply the same changes again. When it comes to runtime configuration, we do have, as I said, we have a custom runtime hooks in place. We are kind of standardizing towards CDI, but we are still not yet completely there. So we still have to install certain hooks on the host based on the runtime type. For example, container d, we have to modify the config file. We have to apply the runtime class configuration. We have to reload the demon. So same thing with Docker, a Docker demon. Same thing with CRIO. We also need to modify the default runtime itself. If users want to run mostly GPU jobs, so they can NVIDIA as a default runtime, in which case if they don't request a GPU, so we'll just fall back to the underlying runtime. And also, reloading a runtime demon is also a challenge, because whenever we apply these changes, we have to reload the demon. And with every new version, we might add some new configuration into the runtime class. So that means how to keep track of how these changes are made in the cluster. So this is something they have to keep monitoring. Another pain point is GPU health. So we have a DCGM exporter. We have a standalone Helm Chat to install DCGM exporter. Again, there is no lifecycle management, right? So the operational teams have to apply these changes. They'll have to set up service monitor, so make sure that they're configured right with Prometheus. So it's all manual process today. And also, there is no robust solution in terms of handling GPU errors. In case if there are GPU errors, how we need to propagate these to the Kubernetes itself? So how can we avoid scheduling jobs onto these nodes? So there is no robust solution that is built for this as well. We have a Kubernetes device plugin. But again, we have a very basic health checking in the device plugin just to make sure that we don't use those GPUs. But on the Kubernetes side, there is no easy way to orchestrate and say that you can't schedule any jobs on these nodes itself. So with that, let's get into Kubernetes operators. So we started looking into, initially, we had various ways to kind of deploy GPU stack. We had standalone Helm Charts. We had OS native packaging. It was everywhere. So a few years back, we started looking to kind of create like a unified solution to have a common API to configure these things in Kubernetes. So operators was an obvious choice. It gives a common controller pattern where you kind of declaratively define the config you need, and it goes and applies the desired state to make sure the actual state matches with the desired state. The control loop paradigm really worked great for us. We were able to bring all the software together and deploy this with the operator. Recently, with recent years, we have seen many cases where there were a lot of inter-op issues between these components. So it's become so easy by configuring with a single API, it becomes so easy to make sure that they stay consistent. Whenever we deploy these things, they stay consistent and don't break on upgrades. How are they built? So they're built using common tools like QBuilder. We have operator framework. Just with a few handful commands, we can easily build operators. These tools will help with generating the initial scaffolding. We can define an API, and also it automatically generate all the manifests that is required to deploy in Kubernetes. For OpenShift, there is again like a life cycle manager, operator life cycle manager. Using operator framework, we can also generate all the manifests that are required to deploy them in OpenShift as well. So why are they useful? So today we're gonna talk more into GP operator, how we have used this pattern and solve these issues. So to give an overview of GP operator, so we wanted to have a unified API to configure everything in Kubernetes. It gives a single pane of glass to kind of configure and manage the life cycle of all the components that we talked about. Starting with NVIDIA driver, container runtime, a device plugin, and the monitoring software. We also have some advanced components like mic manager deployed through GP operator. So it's given as a single API to kind of easily configure these things. Again, how we install this in Kubernetes? We have a standard tooling package manager called as Helm. With just with a single click install, we can deploy GP operator. And if you're using OLM, then we can deploy using operator SDK. We can easily spin up GP operator parts and we can create a custom resource which will enable deploying all the operands that we have. At the high level, this is the state machine. The soon after install, GP operator part comes up, right? And we also have a CRD called as a cluster policy. And also we have CRD called as NVIDIA driver API. So using these APIs, users can define what is the configuration that they need for the GPU stack. The operator comes up, so it kind of fetches the API config from cluster policy and NVIDIA driver. But nothing is deployed yet. So we have a dependency on a bootstrap operator called as NFD. So NFD is a node feature discovery operator which is open source. And mainly the functionality of NFD is to discover the hardware features on each of the node and kind of enable other applications to detect them and run suitable applications on them. So NFD enabled us to kind of seamlessly identify GPU nodes. So it has a CRD called as a node features. And based on the node features CRD, NFD parts will kind of label saying this node has a GPU. And this is the architecture of the node. This is the kernel version of the node. There are like multiple properties of the server itself that are labeled on the node that the applications can use to schedule onto those nodes. So once the NFD labeling is done, so we can start with the bootstrap. GP operator comes up and we deploy the containerized driver to start with. The driver installation takes around like three to five minutes, so we have, we currently support installing drivers through a run file which will dynamically compile and install. And also we have a pre-compiled packages supporting the driver containers as well. So depending on the type of installation we use, takes about like three to five minutes to bootstrap the driver on the node. And once the driver is installed, once the driver is loaded and all the libraries are installed, we bring up container toolkit, which is again our core service to kind of inject all the GPUs into application parts. So until the core services have come up, we can't really run anything else. So all the other parts will be waiting on these things to come up. It'll, so if you see, as soon as you install GP operator, if you see most of the parts stuck in its state, it is due to the reason that they're waiting for the driver installed to complete. So these things really order well among themselves. So once the driver installation is done, once the container toolkit setup is complete, that's when we bring up a rest of the stack that includes the device plugin, the GPU feature discovery, Mac manager, right, and also the rest of the monitoring stack. So all of the rest of the services will come up. And also we have validation built in into the operator itself. At every stage we perform like CUDA validation and the plugin validation to make sure that the stack is completely functional. And finally we mark the node as ready. So a bit more on NFD itself. So we have this dependency on node labeling because we didn't want to build another controller to kind of manage node labels. So it has some APIs defined like node features and node feature rules, where users can come in and define custom rules and say, if server have these properties, this is the label I want on the node. And also it has certain standard labels, for example, certain standard PCI labels. In this case it says, okay, it's NVIDIA GPU and it labels the node with NVIDIA vendor ID. So to talk more on the driver management itself. So the first thing we did is containerizing the driver. So we have had this from a long time, we were using Docker containers to just do like a testing, mostly in the testing environments. Later with the GP operator, we have built the whole lifecycle components into the driver itself. So we install everything through the container, we bind onto the host path, right? So the rest of the components can use them. And also we built the whole lifecycle aspect of the driver itself. So we, recently we built an upgrade controller, right? And we were also making progress in terms of having a common upgrade controller between multiple operators. We do have network operator which does a MoFed installs, MoFed driver installs, right? And we do GPU driver installs. We have like a synchronization between them to manage these things. And typically, so we install everything into a container, the container comes up and we can execute into the container and do all the commands that typically we do on the host with the drivers installed. So please refer to the talk done by my colleague yesterday about the, if you want to learn more about the driver installation and how the upgrade and all those things work. So just to give perspective of how we've been adding different features into the driver container itself over the past two to three years, we added significant features into these driver containers, making sure that we cover all the functionality that is supported by NVIDIA drivers. So initially in 2021, we had basic driver installs just compile and load the modules, bind monitor on the host, nothing else. But we added the lifecycle components. We added a NVIDIA VGPU driver support. And later in 2022, we added GPU direct RDMA support loading NVIDIA PRMAM, GDS drivers, GDR copy. So we added complete support for GPU direct technologies as well. So other notable ones are upgrade controller, advanced upgrade controller. You can listen to the talk again on this one. And recently the focus have been how we make it kind of easy to bootstrap the node. So we've been looking into pre-compiled drivers, how to use pre-compiled drivers across all the operating systems that we use. So Canonical builds pre-compiled packages for NVIDIA and publishes them. So at least for Ubuntu, we started publishing pre-compiled driver containers with NGC, our NVIDIA container registry. With certain kernels and driver versions, on a daily basis, we kind of build and publish these container images to NGC. So again, we are expanding pre-compiled drivers to other operating systems. We are looking at cloud-native, CSP-native operating systems. We are looking at REL or KOROS to have these pre-compiled packages. Again, like heterogeneous drivers, right? This is one common use case. I want to run different drivers in the same cluster. I want to run different kernels in the same cluster or operating system versions in the same cluster. So, or different drivers types in the same cluster. So we supported this as well. So the currently, as I mentioned, the focus have been having pre-compiled drivers everywhere, remove the dynamic dependencies that we have to install these drivers, right? We don't have to pull any packages or build them. So the focus have been mainly to kind of enable pre-compiled drivers everywhere. And we are also looking into CSP-native operating systems. With this, I'll hand over to Kevin to talk about how the GPU configuration is done on each node. Yeah, so in addition to all the great features that the GPU operator gives you in terms of driver management and so on, one of the big things that you're able to do with the GPU operator is actually configure the ways that you want the individual GPUs to be set up by the time your workloads go to run on them. And in particular, things you can do is set up the different sharing strategies that I talked about in the keynote on Wednesday. So being able to set up a set of big partitions on the GPU, being able to set up an MPF server to run and space partition your GPU in various ways, set up time slicing on these GPUs. And one thing to note that at least with the current GPU operator and the way that it currently works with the APIs that are available to it from Kubernetes, only this admin has the ability to do this. So you have to kind of a priori decide how do I want to divide my GPU up into different big devices? How do I want sharing set up by a time slicing an MPS? You have to drain the nodes. You have to apply these configuration settings. And once those are in place, a user workload can come along and consume them based on whatever the admin has decided is the right way to set this up on an individual node. In the future, once we have dynamic resource allocation support, it won't be admin driven anymore. At the time that you create a claim to reference one of your GPUs, you can decide how you want this sharing set up. You can decide what configuration parameters you want on the GPU that you're going to be given access to. And so it kind of moves this ability to define what sharing settings you want on your GPU from the admin a priori to the just-in-time usage of the GPU when you get access to it on your workload. But at least in terms of how you use this today, the diagrams here on the right show how an admin can come along and configure a set of GPUs on a node with a set of big devices. So the one on the top shows how you can divide all of the GPUs into what's called a 1G, 5GB device. And you can get seven of these on a single GPU. And you can advertise them as nvidia.com slash GPU. And as you request that, you'll get one of these big devices rather than a full GPU. And this is what we call the single mode, because the GPU itself, or sort of the resource used to advertise this is the same as what you would have from a full GPU. So from an end user's perspective, he doesn't necessarily know or care that he's getting a big device or a GPU. You might want to advertise it this way, so that he doesn't have to change his pod spec when he's requesting these. On the flip side, you can set it up in something we call mixed mode, which allows you to change the name of the resource that you're actually trying to get access to, because you know that you exactly want one of these 1G, 5GB devices versus a 2G, 10GB device and so on. So there's different modes of operation that you, as the admin, can decide for how you want to share these GPUs and how you want to set these things up. And the APIs will be very similar once we get with DRA, but as I said, you'll be able to do that as an end user as you request access to these GPUs rather than it having to be set up a priori and you just grabbing a reference to what's already there. Next slide. And to enable all of this, as Shiva mentioned, we have this component in the system called the NVIDIA Container Toolkit. And just like the driver that you need to install on your host that runs in a containerized environment where you take the driver container, you run on the host, it's going to install a kernel module to represent the NVIDIA driver and as well install some, the user space libraries inside the container itself and you mount that back onto the host. Similar thing happens with our toolkit component. It's a containerized installer of the N-Toolkit. So you run this container and what it's going to do, when you see the three cylinders on the right of the box, it's going to go through and back on the host, install this binary called the NVIDIA Container Toolkit, which your container runtimes need to call out to in order to actually inject GPU support into a container. And that gets it installed back on the host. It's going to expose some of these socket files you see here and depending on the runtime that you actually have configured, it's also going to update the configuration files for those runtimes so that they know how to call out to this toolkit at the appropriate time. And the GPU operator automates this entire process. So if we go from top to bottom on that list that you see there on the right, the first thing it does is it determines whether you have your driver installed directly on the host or if you've used our GPU operator managed driver installation process to use a driver container to install it. It detects which mode of operation has been used to install your driver. It then optionally updates the default runtime that you have in place. So you don't necessarily have to go via the toolkit in order to run every single container that you have on your system, whether it uses GPUs or not. If you don't want that set up as the default runtime, then you can say that you want to use this runtime for containers that want access to GPUs. But you don't have to make that the default. And in order to enable this, it adds a runtime class spec so that you can reference specifically the runtime that's going to use GPUs versus not. It also, once it updates the configuration files for these runtimes, it will hook back into the system. And if you're using system D, it will restart system D. The system D unit that represents, say, container D as an example. And then it will do a bunch of other stuff, basically, to just get this toolkit up and running in order for GPU support to work inside your containers. Thanks, Kevin. So let's also look at the diverse workloads that we support. So typically, container workloads injecting GPUs into containers. But recently, we have also seen use cases around multi-ten and use cases, and also how to securely run certain AI ML applications. So we've been looking at the virtualization solutions like Kubebert and Cata. Kubebert, it's been a few years now. We had support for Kubebert. We support virtual GPUs. We support pass-through GPUs. We have a very good solution out there for Kubebert VMs. Cata containers, currently, we have in tech preview. You can launch a Cata container with a pass-through GPU so that they will be also documented and published as a tech preview. So GPU operator makes it so easy to kind of determine, based on the workload type you want to run on each node. It automatically detects the software stack because each of these workloads need a different software stack, different kind of runtime, different kind of plugins for these ones. So just by labeling the node, saying if you want to run a container workload, VMVGPU or VM pass-through, we automatically bring up the necessary software stack. So again, refer to the talk by my colleague, the Nest Day, on the work we are doing with Cata containers and also confidential computing with this. So GPU monitoring, I'll not go much into this one. So we do deploy DCGM exporter. We have DCGM engine built-in. Or we also support launching DCGM as a separate container. We automate the lifecycle of the DCGM component itself. We also create automatically service monitor. And dynamically, you can also change the kind of metrics you want to collect using DCGM. So we also have a metrics built-in to the operator. We have operator level as well as operand level metrics. So you can easily, the SRA teams can easily monitor if every component is functioning correctly or if the upgrades are in process. You can see that the progress of upgrades as well. So we support a broader ecosystem, be it on-prem clusters or cloud providers. We support various cloud providers. We also have support for on-prem Kubernetes variants. Container runtimes, all the container runtimes. And operating systems, we are adding more. But currently, we are supporting mostly Ubuntu and CoroS and REL, and we are planning to add support for other operating systems soon. And please find the support metrics at this link. Troubleshooting, so we have a kind of script to kind of gather all the required logs to troubleshoot any issues. So if you see any issues, use this command to get all the logs. So let's go into the demo. So we're going to show in this demo how to deploy the operator. So we're going to show how the driver gets deployed, how the device plugin gets deployed. And also we're going to show, so currently, we are adding support for MPS. We are planning to release an RC version next week. So we're going to have MPS support as well, but I'm going to show in this demo. So in this demo, we're going to show how to install. Just with a simple Helm command, you can install GP operator. And once the GP operator install is complete, we can show how the different node labels are applied to bring up the rest of the stack and how to do validation to make sure the driver is installed successfully. So we can see, as soon after the GP operator install, the NFD pods come up. They do add certain node labels saying, OK, this node has a GPU. And based on that, the GP operator will bring up rest of the pods. So we can see the driver demon set coming up, the container toolkit coming up. And we can see some of them in the init state. So this is the ordering I was talking about. And the driver initializes, once the driver initializes, so while it's taking place, we can see the different node labels that the GP operator adds to control each of these. So once the device plugin comes up, we can see the allocatable GPUs on each of the node. So currently, it's still zero. The driver installation is complete. So we can see the allocatable GPU count increased. So each of them have one GPU. So one node has a 100 GPU and two nodes have L4 GPU. So now we'll go into applying different GPU sharing. Techniques, one is applying MIG. So just by applying MIG config on the node, saying 1G, 10GB, so I want all 1G, 10GB profiles, we created seven MIG instances on this node. The MIG manager created seven MIG instances on this node. And from the driver container, we can see all the MIG partitions on that GPU. And also we apply custom configuration for time slicing and also MPS. So we can define a custom configuration for device plugin to create MPS instances and also to create time slicing replicas. So what I'm doing is on the first node, I'm applying the MIG configuration. On the second node, I'm applying the time slicing configuration. On the third node, we are doing MPS configuration. So we can see different replicas created for each of these. So we can see the first node has seven instances, MIG instances, and the second node has three replicas, and the third node has one replica. And we are configuring MPS on the third node. So on the third node, we can see the MPS daemon coming up. MPS daemon comes up. And the device plugin and then the GFD will restart. So soon, we'll see how the allocatable GPUs increase from 0 to 10 replicas that we are applying for MPS. So we can see 10 replicas on each of those nodes. As you remember, so there are like one GPU in each of the node. And now with the time slicing and MPS, we have multiple replicas for each of this node. So I'll go into running some workloads. So with each of the time slicing, we run workloads. I have MIG workloads. We launched some time slicing workloads and also and some with MPS. So time slicing, we're launching a job with three replicas. So we can see three replicas coming up. And similarly, with MPS, we launched 10 replicas. And we can see 10 replicas coming up for MPS. OK, with that, we'll move on. So some of the lessons learned throughout this journey of developing operators is containerizing drivers is not easy. I mean, it's easy to get the driver installs done. But to manage the lifecycle, it's very challenging. So we learned some of these challenges and built advanced controllers to manage these drivers. And CRD management is another big issue. As the versions change, as the API change, you kind of end up managing multiple versions of CRD. And also with Helm, we also have challenges in terms of updating CRD. So we are learning through those. And we also had some features to kind of handle this in the operator. Memory consumption is another issue. We are a cluster level operator, so the client cache gets huge. We are looking into building early using watching selected resources to make sure that we don't consume too much memory. And also, we have a dependency on node labels. That is NFD. So again, we are looking into see how to make sure that NFD kind of don't bring down any of our components. In case if it loses access to the APS server. And here are the quickly some of the upcoming features. We're adding better health monitoring and reporting with the GP operator. We are adding support for Cata containers. I mentioned this is Tech Preview. This is going to be GA. And also, we are adding conventional container support that is going to be GA this year and more on the heterogeneous and pre-compiled drivers. And DRA, obviously, is again a big topic with this conference. So DRA integration is another main feature that we are looking to integrate with the GP operator. So these are the resources. So please, if you have any feedback, create an issue or reach out to the open source project so we can reach out or create an issue. And this is the documentation link. So yeah, please provide feedback and reach out to us. Questions? I have a question about your plans for the future operating systems. Because in Germany, SUSE Linux is used by a lot of companies. And it's also the preferred operating system for SAP. Thank you. Thanks for the question. I think, yeah, this comes up a lot. We are looking into SUSE Linux. I think the main challenge we had is supporting driver containers. So we didn't have the base images, write base images to kind of build and publish these. So what we did was we do support users to build their own container images. We have a process. We have steps documented. So users can build their own container images for driver and deploy it through the GP operator. So that part is users can still run them. But officially, we are working through to see how we can publish and manage through NGC. So how to publish through NGC. So something we are looking into, yeah, we can keep you posted. It's a process, an official process to apply it. So SUSE, to apply it for this build, for this that you built, or the built images, or make them generally available. I think our product managers are usually going through the requirements. Maybe you can create an issue, GitHub issue, and we can definitely prioritize this. If you have a good use case, a lot of customers are using SUSE. So we can definitely prioritize this. OK, thank you. Hey, hi. Thanks for the demo and the session. Very useful. You probably covered it, but I just wanted to understand a little bit more in terms of GPU management and for efficiency. You mentioned MPS, and you also demoed MPS, MIG, and time slaging. But what was not very clear for me, at least, was when to use what. Is there scenarios or this practices as to what would be used when? It depends a lot on your workload. And we have a blog post that tries to help walk through why you might use one versus the other. We didn't link the blog post in these slides, but if you go to the keynote talk that I gave on Wednesday, I have a link at the bottom of one of the slides that should help you make this decision. There's a giant matrix of what advantages and disadvantages of the different approaches are. And it also, the blog post itself talks about types of applications that could benefit from the different strategies. And if you still have questions, reach out to us afterwards, and we can try and help guide you based on your specific requirements. There's no one right answer, I guess, is the short answer. OK, we'll take a look. Thank you. Hi there. So over this side. If you're using a managed cloud provider for something like AKS and you're selecting NVIDIA node pools, do people like Azure already manually install these drivers? And then how will that work if I want to use the operator with some of the advantages that has? I just wonder if you could speak to that. Yeah, it's a great question. So we do support that kind of configuration. If you have typically, by default, if you're using Azure Linux, so they do have, sorry, Amazon Linux. They do have drivers pre-installed, container toolkit pre-installed. So we do support that kind of configuration. You can still take advantage of the mid-configuration or the mid-manager, the toolkit container that we have. You can still take care of all the advantage of those components. But what we are working with AKS or EKS specifically to add support for their native operating systems itself. In this case, Amazon Linux, we are looking to add support for Amazon Linux itself. So you can use the driver container instead of using a pre-installed driver management. So that is in the works. So it's actually possible if Azure already had those drivers on, I can still put the operator on and then kind of use some of the things where you were labeling the nodes to say, set this up to be make or still do that. Yeah, we have documented the process for that one. So we automatically detect that the driver is pre-installed and we disable the driver container on those nodes. So yeah, we support that. Thank you very much. Thank you. Thank you. So yes, my first question is actually kind of the same. So if pre-built images are already available in the cloud providers, apparently, yes. So the second one, you mentioned that it's a problem for monitoring the GPU health. From your experience, what would be the main issues that would happen? Because from what I've seen, if you have issues with your GPUs, your machine would not boot up or everything would freeze? Let me answer the first question. So in general, all of the operands that the GPU operator manages, you can decide if you want them to be managed by the operator or pre-installed on the system. So if you want to use the host manager, if you've installed your driver manually on the host and you don't want the operator to manage the driver lifecycle, you can turn off it using the driver container. If you've manually installed the NVIDIA Container Toolkit on your host, you can tell the operator you don't want it to install that because you've already done that step. Same thing with any of the other components. So there's options you can set when you deploy the Helm chart as to what you want on versus off when you deploy it. And then the second question, maybe you can answer. In terms of error monitoring, that is true. We can't recover most of the time. But currently, what is lacking today is properly propagating those errors at the Kubernetes level. When it's at the Kubernetes level, there is no indication at the node level saying if some GPU is unhealthy, and what is the issue with the GPU. So what is happening today is the device plugin will detect these errors and make sure that that's not allocatable anymore. That's all. So you'll see the number of allocatable GPUs going down. But there is no indication in terms of which GPU has gone bad or what is the error with the GPU. So that's where we are improving saying we'll propagate those as the node conditions and say that, OK, this GPU has gone bad and how to kind of recover those things, recover that node. It's also worth pointing out that we're internally right now trying to come up with a more comprehensive solution for not just GPU health, but node health in general. And the plan would be to eventually integrate this kind of node slash GPU health solution as an operand that the operator can deploy and people can make use of. We're still trying to figure out what that's going to look like, but that's the long-term plan. Thank you. Thank you. Hey, you mentioned for GPU driver, you have a containerized driver. So can you explain a little bit between the native GPU driver and the containerized driver? What's the difference? Yeah, so first off, there was a talk yesterday. He referenced it, Shiva referenced it in his slides here that goes into great detail what our host operator, sorry, our GPU operator managed driver looks like and how he managed the lifecycle of it. But the main idea is that it's kind of a misnomer. We call it a containerized driver, but it's really driver installer wrapped inside of a container. And so what it does is when you run this container, it will install the driver, the kernel module, into the host. It'll install the user space libraries into the container image that you have, but then it'll make that the root of that container's image or the container that's running, it'll make that the root of that file system available back on the host so that from the host's perspective, he has a path to using that driver from anything that's running on the host directly. So it looks different from what you installed directly on the host, only really by where the path to those user space libraries exist. OK, so does it mean NVIDIA will maintain this kind of two different GPU driver version and the user will have option to choose one of them? You can't have both installed at the same time. You either do a host install driver or you use the driver container. You can't do them both at the same time. So either way, but both will be co-exist. Both will be co-exist as the project. It's the same driver. So whether you're installing it directly on the host and then you get it at the root file system where all these files are or you install it in the container. Think of the container as just the DOS where you're installing this now. It's the exact same driver that you're installing though. It's just the method at which it's being made available to software on the host. I say, OK, thank you. I don't know if you mean, by the same project, if you mean same cluster, so then it's still possible to have some nodes pre-installed driver. I think what he meant was, are we maintaining different drivers, whether it's containerized versus not? And it's the same driver at the end of the day. Yeah, same driver. OK, thank you. Thanks for your good talk. So you mentioned cata container in your talk. So I want to enhance our isolation. So I want to use it. But as far as I know, the cata container is only support, don't support, doesn't support multi-GPU or a VGPU, right? So do you have a plan to improve the cata container to support it? Definitely it's in the plan. So this is the roadmap to add support for multiple GPUs and also VGPUs. So I think the current focus for us is to take a single GPU pass through as a GA, right? And then the confidential containers with a single GPU. So then eventually, or this or next year, we plan to add VGPU and multi-GPU support. OK, thanks. Thank you. All right, I think that's all the questions we have time for. Thanks, everyone. Thanks, everyone.