 I think I'm going to get started. So hi, everyone. Thanks for joining me today. Welcome to my session, Enable GPU Acceleration Without Worrying About Managing Device Drivers. I know it's a mouthful. Sorry about that. My name is Chris Destiniotis. I work as a software engineer at NVIDIA. I work on our cloud-native team. We work on enabling GPUs and containers and Kubernetes. So in this talk, I'm going to be strictly discussing Linux device drivers. I'll give a quick definition of what I mean by a device driver and what are some methods for installing them in Kubernetes today or typical methods for installing them. I'll focus the latter half of my talk on some of the day-to-challenges with managing drivers, especially with a larger cluster. I'll present some solutions we've built, especially for NVIDIA GPUs, and then I'll end with a demo. So what is a device driver? We'll simply put a driver abstracts a piece of hardware, right? On a Linux system, typically it's a kernel space component, which is a little kernel module that you build against your kernel and have to recompile on kernel updates. And then you also have, along with it, some user space driver libraries that provide abstractions in user space. So these are usually versioned together. You need both of these to leverage your device or accelerator. So for my talk, when I say device driver, I mean both the user space libraries, the shared libraries, and the kernel module. So what are some, if we were to install drivers on your local system, you would just either build from source or, for your respective distribution, go to your package manager and install a package, right? An NVIDIA driver package. But you can't do that in Kubernetes. We need to install this on lots of machines. So what are some strategies? The first, you can use SSH, right? So you can use some common tool or some familiar tools like Ansible or Puppet, have a common recipe, and go one by one on your nodes and install drivers directly on the host. This is simple. It's maybe familiar to your team or a tooling that's familiar to your team. But obviously we're not using Kubernetes to manage our drivers. So I'm not going to focus on the rest of my talk, right? You need separate tooling. You need to maintain it and separate tooling for monitoring and logging your drivers. Second approach is to build your own operating system image for your GP worker nodes and embed the driver in it, right? So have your own image, maintain it. And this is simple to understand. It's reliable. It's also secure because you're not installing things at runtime. The major con here is it's not as flexible because if you want to deploy new drivers, you typically have to rebuild your OS image, right? So maybe that takes time for your organization. Also, if you have existing nodes, you need to re-provision them, and that may be a lengthy process. Additionally, if you have CPU and GPU nodes in your cluster, you may need to maintain different images for each of those, and that also adds overhead. Third approach is to use containers. So this is not a new approach, but you can package your driver and a set of install scripts into a container image and deploy it everywhere, right? And install, load your kernel modules, and make your user space driver libraries available on the host somewhere. This allows us to do reproducible builds on your cluster. You can manage it via Kubernetes, which is great, and why I'll focus on this method for the remainder of my talk. And just as important, this also works on container-optimized operating systems, like CoreOS or FlatCar Linux, those that are the root file, since it's mostly read-only. So you can't just SSH and install a package, right? It's meant to be an operating system for running containers on. Obviously, I mean, one con is you do need to run a pod with elevated privileges. We are loading a kernel module, right? So it's like you're running as root, so it's something to be cognizant of when you're using driver containers, as we call them. So for the rest of my talk, because we can manage this through Kubernetes, I'm going to focus on this method, but the other ones are valid, and you're free to use them. So quickly, what is a driver container or a pod if we deploy it in Kubernetes? It's just a container that does a few things. It compiles your driver against the kernel, it loads the kernel modules, and then it also mounts part or all of the container's root file system onto the host somewhere using a host path volume. That's because we need to make sure that device nodes, driver libraries are made available in the host somewhere so that we can reuse them, so that containers that want to use a driver, use a GPU, for example, can have the driver libraries injected into them. And once that's done, this is done. We can just sleep. We can take this pod, we can deploy it via a demon set so that we get one driver installed on all your nodes with your hardware. In this case, I have some nice NVIDIA DGX servers with lots of GPUs, and I install one driver on each one of them. Now, a demon set is, we can deploy a static demon set, and that's fine, but for complicated devices like NVIDIA GPUs, configuration can be a little bit complex, and so we, and managing the life cycle a little bit can also be complex. So we would prefer to maybe build an operator to automate and make an easier user experience for enabling certain features. So NVIDIA GPU, specifically, we have an operator. I think we've been mentioned at this KubeCon NVIDIA GPU operator. It automates the deployment and lifecycle of most of the NVIDIA software you need to use GPUs and Kubernetes. So it includes the driver that I just showed, but it has a lot of other components, like runtime components, our device plug-in, monitoring tools, et cetera. I highly encourage you to attend a talk tomorrow by a few of my colleagues who are going to go over this operator in much more detail. I'm just going to focus on the driver aspect and what we offer and some things that I think you can find useful for driver management. It's like, what do we support today with our operator and NVIDIA drivers? So some key features that I want to highlight. So the first thing is, for any supported distribution that our operator claims to support, we build and publish containerized driver images for them on our NVIDIA container registry. This means if you install the operator on your cluster, automatically we will pull a container image for your distribution, for your worker nodes, and compile and load the driver on them. So this is nice. We also are building and supporting pre-compiled driver images, that is driver images that have pre-compiled drivers built into them. So you don't have to compile at runtime. There are some nice benefits of that. You don't have to compile, so it's quicker to deploy, quicker deployment time. There's less runtime dependencies. You don't need to pull kernel headers to compile against that kernel. You don't need certain parts of the toolchain. And because of that, easier to deploy in air-gapped environments, we don't depend on the network anymore. Also easier to support secure boot because you can sign your driver's compilation time beforehand. And so we've started to publish some images for certain distributions, in particular Ubuntu 20.2.04 and some kernel flavors there. And the operator, when you enable it, will detect what kernels you have in the cluster and deploy a daemon set for each specific kernel version so the right image gets pulled. So we also support other features, so virtual GPU drivers. This is a feature at NVIDIA where you can share a GPU with multiple virtual machines. And so there's a host and a guest driver component for that. And we support containerizing those and managing them through the operator. We also support the open GPU kernel modules. This is the open-sourced NVIDIA modules that were announced, I think, in 2022. So we also support those in driver containers. And we also support other advanced features like GPU direct, RDMA, storage. These are things that allow you, these are technologies that allow you to communicate due data transfer between the GPU and other external devices like Infini-Man devices or storage and bypass the CPU. And there's kernel modules that are required to be loaded on your system to enable these. And if you want to opt into these, our operator will actually build and load these modules for you. So it's quite nice. So the next few slides I'm going to just go through. So now that we have an overview of the operator and some of the features, how does it actually work at a fundamental level? So what is a driver container and how it works for NVIDIA GPUs? So you have an OS, you have a kernel, you have a distribution with a runtime there and then you deploy a container. The first thing it does is it builds the kernel modules, loads them to the kernel. If you have pre-compiled packages, it's just going to link and load them instead of compiling. We run this demon called NVIDIA Persistence D for NVIDIA GPUs. Essentially, it keeps a handle open with the driver. It's mostly used for performance reasons. So quicker startup times for your GPU applications. And the container lifetime is tied to this demon. So it will run as long as the container is running on your host. Like I said before, we have to use a host path volume to make sure that the driver installation that's done in the container is available in the host and so that applications that run later can have access to them. So we typically mount this somewhere that's writable on Linux systems. So somewhere under run, for example, in this case, run NVIDIA driver. Before we can actually run a GPU container, there's a project called NVIDIA Container Toolkit that we maintain that enables GPUs and containers. It has a set of tools for different, it's runtime agnostic, but to make sure that your runtime, whether it's Cryo or container D, can support running GPU containers. So we have to install that. And then you can run your CUDA container. I'm sort of bypass, I'm sort of skipping Kubernetes scheduling here. Assume a pod gets scheduled here and you're running a CUDA container that requested one or more GPUs, then the driver installation, the container run sign will make sure that the files under run that need to be injected to whether the device nodes, driver libraries, they're made accessible to your container and your container can run just fine and leverage CUDA and our driver. Okay, so we have an overview of how driver container works. And I want to sort of focus on the next part of my presentation on some challenges with managing the life cycle. So there's a couple of challenges I want to highlight in my talk. The first is that clusters are not all the same. So you may have GPU nodes that have different hardware or different system configurations. And so it may be necessary to maintain or manage different driver versions or different driver configurations and doing that at scale, it can be a little difficult. The second thing is upgrades. I don't need to explain why upgrades are difficult. So right, in particular, driver upgrades are more challenging because it's like performing node maintenance or you're unloading a kernel module and loading it. So it is disruptive to applications running on the node that are using the GPU. So there is some care that needs to be taken. So I'm gonna quickly cover heterogeneous clusters. This is the sort of problem and what we've built to solve it. So as a cluster administrator, what are some possibilities that you would like to do with your GPU nodes? The first thing is maybe deploy different driver versions in your cluster. An example of this is you have a cluster with older GPUs and then you wanna add more capacity with the latest and greatest and maybe the drivers that are supported on the older ones are not the same as the newer generation. So you have to maintain different driver versions for those sets of nodes. The other scenario is maybe you have different operating system versions running. Maybe you're adding new capacity with about 20.204, but you still have some nodes that are running up to 20.204 or you're upgrading existing nodes and you wanna make sure that the drivers still work during the upgrade, right? So you wanna make sure you have GPU drivers running on both operating system versions. Maybe you have some nodes that have some already made devices and they require some of these additional drivers to leverage GPU direct and some others don't. So there's some heterogeneity there and then the list kinda goes on and I'm not gonna go through, but I think if your cluster is not built the same, you may need to deploy different drivers in different configurations. So using an operator, what did we build? So we have a CRD called NVIDIA Driver that allows you to simply create different instances. So if you have different pools of nodes that have different configurations, you can deploy one NVIDIA Driver instance per pool. It allows you as an admin to sort of logically partition your GPU nodes in terms of what drivers you want on them. It supports different driver versions. You can deploy different drivers on different operating system versions and even driver types where you can deploy virtual GPU drivers like I mentioned earlier or not. So in this example, we have four nodes and the first two are getting their driver from the red configuration and the node three and four are getting their driver from the purple instance of the NVIDIA Driver CR. So a more concrete example, right? We have two actual simple NVIDIA Driver instances. One I've named Kepler because I have a group of nodes that have K80 GPUs attached to them, which are very old and they're only supported up to the 470 driver. And on the right, I have an instance called Ampere with some A100s, which are much later generation of GPU and they're supported by the latest, which is 550, right? And so this allows me to sort of logically partition, right? I have some old GPUs, I have some new GPUs, I need to manage them differently and this makes it quite simple to do that. And also the operator right will, when you give it a node selector saying, okay, this is my label that identifies my pool of nodes, it will go through them, see, okay, what are the operating system versions running on those hosts? And if it does detect that you have different versions, maybe over a bunch of, it will be smart enough to deploy different demonset per operating system version. So you do get that out of the box when you use this CRD. So that was quite simple. So I'm gonna focus now on driver upgrades, which I'll spend a little bit more time on. So what are some basic requirements when you wanna upgrade your driver? Well, first of all, pretty obviously you wanna maintain availability of your cluster and of applications that are using the GPU. But at the same time, you wanna be able to do, well, not at the same time, but you wanna be able to perform the upgrade in a controlled fashion. If you have critical workloads that are using the GPU that can't be disrupted, we wanna be able to selectively, not drain nodes or kill workloads during an upgrade. We wanna sort of wait for critical work to finish. At the same time, you still wanna be able to have automation built for this. We don't wanna have to manually go node by node and start maintenance on them. At the same time, we also want to do it automatically and be able to monitor progress and if failures happen. So the next few slides, I wanna actually walk through an example of what the Demon Set controller can offer us. So like the Demon Set controller, we're deploying drivers via Demon Set and the Demon Set, you can upgrade Demon Set. So what are some upgrade strategies and will it actually meet all of those requirements that I just laid out? So there's two update strategies. There's, I don't use my pointer here. There's the rolling update and the on delete. So on delete is that the Demon Set controller doesn't do anything. You update the version of the Demon Set and you have to actually go in node by node and delete the old pod for the new version to get rolled out. So there's no actually automation there. Rolling update is just a traditional rolling update as you know it. So the Demon Set controller will go in and select nodes to upgrade and roll out new versions of your pod or of your driver in this case. So because it's automated, let's go through a quick example and see if it satisfies our requirements. So we wanna upgrade the driver. So we patch the Demon Set, maybe say I want the latest version of the NVIDIA driver. The Demon Set controller will pick a node to upgrade. We'll delete the old version of the pod and roll out the new one. This might work, so it'll compile the driver. When it tries to load your drivers, it will fail because you have drivers already loaded. So it's gonna, we have to go back to our driver pod definition and add some logic to unload a previous install, right? So if there is something already installed, uninstall it, quite simple. But what if there's something, what if there are applications using the GPU, right? So then you're gonna also fail to unload because the modules are busy. So if that more logic to our driver pod, which is, okay, if we can't unload, we need to drain, we need to kick workloads off the node. So if we do that, then we can install the new version of the driver successfully and repeat this on all of our other nodes. So what I showed is that we can't use the Demon Set controller for driver upgrades, but you don't really have the control to, you have to drain the node. So that can be disruptive. If you have long running, for example, training, you don't want to lose lots of work that's been worked on, right? So you want to either wait for a training epoch to finish or something to be checkpointed, right? Before draining the node and migrating that workload. So this will not suffice for managing GPU driver upgrades. What about the on-delete strategy, right? Well, that is sort of the other extreme, right? It gives you the full control. So you can, as an admin, go one by one on your nodes, coordinate them, drain, or wait for jobs to finish, and then initiate the upgrade. Obviously, this is not, this is entirely manual, so we're not gonna go with this approach, but it's entirely valid. This is what people do with our drivers if they deploy via Demon Set. They typically do this approach, right? Very careful. They have their own set of processes in place too. They know what's running. They know which nodes are sort of ready to upgrade and they'll go in and coordinate and trigger upgrades themselves. But they may also build their own automation and we don't want teams sort of duplicating this, right? That we wanna have a common solution to perform upgrades. So how do we solve this problem? So we, in our operator, built a controller to facilitate this upgrade. The core idea is we configure our driver Demon Sets with the on-delete update strategy because we don't want the Demon Set controller to actually carry out the upgrade. Instead, we have our controller that goes in and coordinates nodes, waits for a job to finish, which is a configurable policy that you can have as an admin and then right proceeds with the upgrade. And what's nice is the controller emits a metric stirring the upgrade to give you some idea what the progress is and if any failures happen on nodes. And like I said, the upgrade behavior is configurable through a CR. Something else to note is that we worked with another team at NVIDIA, the network operator team. So their operator helps configure all the networking stack for GPU Direct RDMA. And so they also manage drivers and so they have the same problem. So we sort of collaborate on a shared package that we can both use in our controllers to manage upgrades. This is just an example of some configuration you can do. So you can have like, you know, how many parallel upgrades you're okay with having at one particular time, max unavailability of the cluster during an upgrade, some basic configuration in terms of what pods you wanna wait to complete for before proceeding with an upgrade. And so an empty selector means wait for all GPU jobs to finish and you can specify a timeout. So if you're okay waiting forever, wait forever, otherwise specify something not zero. And some other, you can opt in to drain or disable drain if you don't wanna drain at all during an upgrade. So here's a quick example of what a our controller will look like when it's facilitating an upgrade. So like I said, we configure the demon set as on delete and we have our controller that's going to facilitate this. So the first thing that our controller does during reconciliation is it goes through all of your nodes with drivers and checks is an upgrade required, right? Right now, the drivers are in, have the same revision as the parent demon set. So nothing to do, right, they're in sync. So we label each node with what state they're in. So in this case, their upgrade is done. That's the value of this label that we give on the nodes. So the label is just representing what state the node is in, the driver on the node is in. So right now everything is done. There's nothing to do. As an admin, we go in and we upgrade the demon set and on the next reconciliation, this label will change. All of a sudden now the pod is out of sync with the demon set and an upgrade is required. So now these get relabeled. The controller will, based on your policy, will initiate a rolling upgrade. So it'll pick a pod, or pick a node to upgrade. In this case, it's picking the first one. And the very first, the next state is to cordon the node. After that, we wait for GPU jobs to complete based on a policy, based on your configurable policy. Once that job completes, we go to another state which is sort of drain, but just drain any remaining GPU jobs if there are any. Then restart the driver pod so that the new one gets rolled out, do any validation to make sure that the new driver is healthy, uncoron the node, and then mark it as done, and then repeat. So pretty simple. Okay, how are we on time? Pretty good. So I'm just gonna do a quick demo showing both of the controllers to give you a better idea. So I already recorded this. This is too small. Okay, it might be a little small, but I'm gonna highlight some things and walk you through it. So I have a cluster on AKS for this demo. I created two node pools. So one is a T4 pool that has two nodes, each with a T4, and V4 GPU. I have another pool with a V100 node pool with two nodes as well, each with a V100. So that gives me a total of four GPU nodes in my cluster, and I have a little metric on my dashboard just to show that. I've also beforehand, I've installed the latest GPU Operator Helm chart, and I've actually already brought up the entire stacks. I've already installed drivers. So what I'm gonna show you is that this metric is showing how many nodes have the 525 driver installed. I have two, and I have a number of nodes with the 535 driver, I also have two. So I've basically deployed different driver versions on my different pools. And we'll show that in one terribly. So yeah, I've already pre-created the driver, NVIDIA driver instances, one named T4 for my T4 pool, or I'm deploying the 525 driver. I'm also using pre-compiled drivers because I have images available for these kernels. So that's why we see two 525 drivers installed in my cluster. For my V100s, same spec essentially, but I'm installing the 535 driver also using pre-compiled images, and that's why we see two over here. If we inspect, I'm gonna inspect the actual demon set spec for one of these and show you the actual image tag that gets resolved. It has the driver version as well as the kernel version in it. So here are the demon sets in our namespace. We have a bunch of demon sets, and then we also have two driver demon sets. One is for each pool. And this is for the T4. This is the demon set for the T4 pool. I'm installing the 525 driver for this specific Azure kernel running a bunch of 2204. So the operator automatically detects what kernel you're running in this pool and constructs the tag appropriately. For the rest of my demo, I'm actually gonna trigger an upgrade for my T4 pool. I'm gonna move them off of the 525 driver and onto the 535 driver, right? Before I do that, before I start the upgrade, I'm just gonna launch a very silly job to keep my T4s busy, my GPUs busy. So I'm just running a job that's gonna sleep for three minutes, right? On both of my T4s. So I'll skip a little bit, I think. Yeah, so the pods are running. They're gonna be running for three minutes. I'm gonna go to my T4 instance, and I'm going to manually, right, just change the version from 525 to 535 to trigger an upgrade. You'll see me type it immediately, right? My T4 instance is, my T4 driver is not running anymore. Right, the spec is out of sync with what's actually installed in the nodes. We see here, we have, so let me explain this dashboard. These are just metrics from our driver upgrade controller, saying how many nodes upgrades are done for. So initially all four of my nodes are in syncs, like the drivers are in sync with the spec at the configuration. So initially, right, four of them were, there was no upgrades were pending and no upgrades were in progress. But now when I upgraded my T4, now it shows that two nodes, only two nodes in my cluster have finished upgrading. So two have upgrades required. Shortly after we see our upgrade controller has already cordoned one node, so it's already picked one T4 node to start an upgrade. And so one is labeled as upgrade required and the other one's already waiting for the GPU job to finish completing. So it's sort of stuck at that state. So I'm gonna skip a little bit. So I'm sort of just waiting for my job to complete, typing some commands on the screen, waiting. Eventually my job does complete and the node gets moved to the next state. So it's actually restarting the driver pod. I'm just showing that and showing the new driver being deployed. Shortly after we see my gauges change so now there's only one 525 driver installed in my cluster. It's validating the new install. It's done now. My other gauge has changed so now we have three instances of the 535 driver and my little charts here are updating. So three nodes are done, we have one left to upgrade and this one already has cordoned. The controller has already moved on to it and it should upgrade pretty quick because there's no jobs running anymore. So it already is getting restarted, the driver pod. Gages change, the other one's waiting to change as well and there we go. So everything's done, all my charts are back to what they were initially and now we're all upgraded from 525 to 535. So yeah. Okay, just to summarize. So managing device drivers at scale is difficult, right? Driver containers allow us to manage the life cycle through Kubernetes which allows us to build some cool solutions on top to automate some of the pain points, right? So what I demonstrated in this talk is where we can deploy and manage different driver versions of configurations through controllers as well as manage upgrades in a controlled yet automated fashion. So yeah, so that concludes my talk. Happy to answer any questions you guys have. I think we need to give you a mic first. I have two questions. The first one is, do the GPUs require special firmware and how are firmware updates handled? And the second question is, if secure boot is switched on, you need to sign the drivers. How are the drivers signed? So yeah, you do need firmware in the answer to how they are managed. We don't manage firmware updates. I think that's something that's been asked but we haven't supported that with our operator, right? So maybe that's something we will consider supporting in the future or maybe having a separate operator dedicated to doing like firmware updates on Kubernetes. And for secure boot, yes, you're right. You have to have signed driver packages. So when I was talking about pre-compiled driver containers, for a bunch of canonical signs are drivers. So all of the tags that we have on our registry for pre-compiled driver images are actually signed. So if you run them on a secure boot system with a bunch of 2204, they'll be able to load just fine. If you're running non pre-compiled images, you sort of have to build, add your trusted keys into the image. So at build time, they're signed but not many people are doing that. Typically pre-compiled is the way to go. So once we add more support for different distributions like REL, I think we're in discussions with them for how to get them to sign the packages and give them to us to build container images with them, right, included. Yeah, yeah. Hello. I think in the second part of the talk when you started talking about the demons that the drivers you were referring to were the user space drivers, right? The kernel mods came out. So you still need to add those to the kernel in the node itself. So everything I showed is both. So like when I was doing an upgrade, I was actually unloading the old version of the kernel module and then loading the new version as well as making the new version of the driver libraries available on the host. And the demon set does all that? Takes care of it. Yes, it's running as privileged because it has to load and unload kernel modules but it's doing that. And that's why you said that the path needs to be exposed to the kernel, right? So before to the operating system, that's the part where you said that you need to show what's inside the container to the operating system. The reason I mentioned like a host path volume, so mounting something from the container to the host is so that you can make the driver installation. So everything that device nodes, for example, and like shared libraries, they need to be somewhere in the host that you can share them with other containers that wanna use the device and the driver. Okay, thanks. Yep. Hi, do you think that there could be any issues updating the network driver? The same approach as you used to update the video driver. Maybe we could expand it and upgrade the network card as well. Do you see any challenges with that? I mean, like I mentioned our network operator team that installs like the MoFed stack, they're using this approach just fine, but I'd rather have you reach out to them to see if there's any particular challenges there that we don't encounter for GP drivers, because I'm not aware of today. Yeah, actually on my side, I also have like two questions for you. So we are already using this approach for almost a year and then it was kind of vocation, but then, for example, now the 2410, the new Ubuntu version will be available, I mean, in almost a month, right? How fast are you providing drivers in your NVIDIA registry? And now the second one is also related slightly to the secure boot. Yes, you have some signed drivers in the registry, but I mean, not all the drivers are signed. So really a lot of them are not signed. And then, for example, for big companies, it's actually difficult because, for example, we have services which are using GPU drivers across, I don't know, 20 teams, right? So then we always want to be in sync, but then in some parts we cannot because actually the drivers are not signed and it's always like a challenge for us to do it. So do you have any plans to try to sign all the drives because I don't see any downside on your side? As long as we get packages from like operating system vendors that are signed like Canonical or Red Hat, but these are the main distributions that we support, as long as we get packages from them, then we are all on board to provide images that contain those signed pre-compiled packages. For Ubuntu, we do, so that's why we've been automatically sort of when new kernels come out, we are publishing tags, right? We have some automation to do that relatively quickly after a new kernel releases. For Red Hat, we're still sort of in the process of getting that sorted out, so it may take some time. I forgot your first question, so I don't, I'm sorry about that. You can catch me out there if you wanna ask your first question, I don't know who else. Hi, my incredible question about did you suffer some device number coordination challenge and also the update come with the system crash and any rollback suggestion? Yeah. So what do you mean by device number? Issue, like device major and minor number issues? Yeah, yeah, yeah, because we're used to update the firmware and sometimes we need to identify with the serial number and the device number like kind, but in container, some way will come with fail, so yeah. I'm not aware of any issues, but maybe you can give me more context after about what exact numbers were different in the container than on the host. What was your second question? After update firmware, is there any rollback suggestion can you suggest in the container backfail to do the preparation? For firmware? For firmware? Indeed, after execution, come with the system crash, yeah. For driver upgrades, I mean, we can do a rollback with the demon set, like rolling back the version of the demon set for, if there's a system crash, we don't really have good failure recovery currently with GPU nodes. Like our device plugin sort of marks your note as unhealthy if there's some sort of driver or hardware issue sort of marks as unhealthy so you can't schedule jobs on it and it requires you to sort of manually go in and fix the problem and then bring back the device plugin. So we don't really have good failure recovery. If that answers your question, I hope. All right, I think I'm getting kicked out, so thank you all for attending, appreciate it.