 Hello everyone, my name is Kevin Clues and I'm a Principal Software Engineer at NVIDIA. Today I'm going to be talking about the support I built for multi-instance GPUs in containers and Kubernetes. Multi-instance GPUs, or MIG for short, is a new hardware feature of the latest NVIDIA Ampere GPUs, so everything I talk about today is specific to this type of GPU. Before I begin, I'd like to remind you that I am available for chat during this talk, so as questions come up, please feel free to drop them on Slack and I'll make sure to answer them as best I can. I imagine if you're at this talk, you probably already understand the benefits of supporting GPUs in Kubernetes, so I'm only going to spend a few minutes highlighting some of their more important use cases. First and foremost, supporting GPUs in Kubernetes lets us scale up AEI training and inference jobs to a cluster of GPU machines using the same underlying infrastructure we use to deploy all of our CPU-based workloads, such as web servers, databases, streaming data servers, etc. Using the tools we've built as a community, users are able to take a Kubernetes pod spec, specify some number of GPUs, and direct their application to a particular class of GPUs that can meet their workload demands. A typical GPU cluster then looks something like this, where you have a bunch of servers, each hosting some number of possibly heterogeneous GPUs, all being managed under a single Kubernetes instance. Where training jobs tend to require multiple more powerful GPUs, and inference or development jobs tend to only require a single instance of a less powerful GPU. But what if you don't have the luxury of having such a diverse mix of GPUs in your cluster? Or as is often the case, you have a small cluster with only the most powerful GPUs you can get your hands on and want to be able to efficiently use them for all types of workloads. And even if you do have a bunch of nodes at your disposal, wouldn't it be better to have your cluster full of the most powerful GPUs possible so that you have them when you need them, but then have a way of sharing an individual GPU for jobs that only require a fraction of its full power? Well, this is where multi-instance GPUs come in. It's hardware support for taking a full GPU and dividing it into several smaller, what we call GPU instances, each of which have their own dedicated set of memory and compute resources. Additionally, each GPU instance gets a dedicated percentage of the overall memory bandwidth. And any faults that occur are isolated to a single instance without disrupting the others. From the perspective of the software consuming it, each of these GPU instances really looks like its own individual GPU and can be treated as such by the end user. All GPU instances run in parallel with predictable throughput and latency, providing improved quality of service over previous GPU sharing solutions, such as MPS or CudaStrengths. It allows you to get the right size to GPU for your job in cases where you don't need the full power of an entire GPU. And it's already supported in a diverse set of deployment scenarios, including bare metal, VGPUs, standalone containers, and Kubernetes, which we're focusing on today. So with that brief introduction, this is the outline for the rest of the talk. I'm going to start by going into a bit more detail about what multi-instance GPUs are and how they can be used. I'll then give a brief overview of how full GPUs are supported in containers and Kubernetes today, followed by the details of how we have added MIG support to them. After that, I'll talk a bit about the system level interface for MIG, which has made all of this possible, followed by a discussion on best practices and tooling available for provisioning a set of MIG devices across a cluster of machines. Finally, I'll summarize everything I've talked about here today and end with a demo. OK, so as I said before, MIG is a way of taking a full GPU and dividing it into a smaller set of what we call GPU instances, each of which has some amount of guaranteed memory and compute resources available to it. But what are some typical use cases for multi-instance GPUs? Well, in general, MIG is useful any time you have an application that doesn't require the full power of an entire GPU, but does want similar isolation guarantees provided by full GPUs. This might be a single user who wants to run multiple inference jobs or some sort of experimentation he is doing, or it might be several trusted users from the same organization, each running model exploration in a Jupyter Notebook on their own dedicated GPU instance. And given the hardware isolation that multi-instance GPUs provide, you really can treat them as separate GPUs, allowing untrusted users to run on their own dedicated instances, as would be the case when running on a managed cloud service, such as Google Cloud AI or Amazon SageMaker. But can we really think of a GPU instance as a replacement for a full GPU? Well, in the case of running inference on the BERT natural language processing model, a single GPU instance constituting 1 seventh of a full A100 GPU was able to perform on par with a full V100 GPU running the same application. Likewise, running seven of these GPU instances in parallel gave us roughly seven times the overall throughput of a single GPU instance, showcasing the linear scalability you would expect from seven distinct GPUs. So why seven? This number keeps cropping up, and it's not clear exactly where it's coming from. Similarly, what exactly goes into creating these various GPU instances, and how do you get access to them? Well, the best way to visualize this is with a diagram like the one you see here. This shows how the 40 gigabyte A100 card can be broken down into a set of eight five gigabyte memory chunks, and seven, what we call compute slices. It's important to note that there's also a variant of this for the 80 gigabyte A100 card, which has 10 gigabyte memory slices instead of five, as well as the newly announced A30, which has four six gigabyte memory slices and four compute slices. For the purpose of this talk, I'm going to stick with the 40 gigabyte A100 variant, but everything I talk about is applicable to all of these cards. So coming back to this question of why seven, ideally we would have included eight compute slices to match the eight memory slices we have available. However, the yields in the actual silicon make it so we can only reliably get seven compute slices instead of eight. So we are stuck with a somewhat awkward combination. In any case, to create a GPU instance, all we need to do is combine some number of compute slices with some number of memory slices and merge them together. In the case seen here, we are merging a single memory slice with a single compute slice. And that name you see in the middle there, that 1g.5gb, that's an example of our canonical naming convention for this combination. Anyone familiar with Meg has likely seen names like this already in the output of NVIDIA SMI. Now, jumping to a slightly bigger GPU instance and digging inside of it, we see that we can actually perform a second level of partitioning, which subdivides a GPU instance into a set of what we call compute instances, all of which share access to the memory of the wrapping GPU instance but have their own dedicated compute resources. And when we do this, you'll notice that we tack on an extra parameter to the front of the canonical naming convention to denote this. Once we've done this, we now have all the pieces required to form what's called a MIG device, which is the actual entity recognized by CUDA on top of which workloads are able to run. It is represented as a three tuple of the top level GPU where the MIG device has been created, its GPU instance, and its compute instance. This definition will become important later on when I start to talk about how MIG support is made available to containers and Kubernetes. So anyway, jumping back to our compute instance example, the compute instance may be created to consume the first compute slice, or the second, or the third, or the fourth, where it might take up two compute slices, or three, or all four. And in this special case where a single compute instance consumes all of the compute slices of its wrapping GPU instance, we talk about the GPU instance equaling the compute instance, and drop the compute instance portion from the canonical naming scheme. Going back to the more simplified name shown here, dropping this prefix is unambiguous because these names are meant to refer to the actual MIG devices they represent, and not any individual GPU instance or compute instance. And in the context of Kubernetes, these are actually the only type of MIG devices we currently support due to the fact that all compute instances share access to the GPU memory of the wrapping GPU instance. Subdividing into multiple compute instances does not fit well into the isolation guarantees one normally expects from the Kubernetes resource model. As such, for the remainder of this talk, I'll only be referring to MIG devices of this type. We may expand on this support in the future as things like pod-level resources start to take shape. For now, this is what we have, and I'll go into the details of how all of this works a little bit later on. Now, with all of this said, we can't just arbitrarily create MIG devices with any combination of memory slices and compute slices. For example, this one isn't allowed, and neither is this one. In fact, there is a distinct set of MIG devices which actually can be created. You can have one of these, or one of these, or two of these, or three of these, or seven of these. You can also have some combination like this, or like this, or like this. But you can't have something like this, or like this. In general, this diagram represents the physical layout of how these MIG devices are created on the actual hardware. So finding a valid combination consists of walking from left to right and adding devices into the configuration such that no two devices overlap vertically. When applied to an actual node, configurations like the ones you see here are most common. That is, striping a single device type across all of the GPUs on a machine, turning an 8 GPU DGX-A100, for example, into either 56 1G 5GB devices, 24 2G 10GB devices, or 16 3G 20GB devices. However, it's not uncommon to also see a configuration like this, where a single node has a good mix of different device types. Or even this, where some GPUs are not configured for MIG at all and some are. It really just depends on your particular cluster configuration and how many nodes you have at your disposal to decide which configuration is best for you. OK, now that we understand MIG a bit better, I'm going to switch gears for a minute and give a brief overview of how we support full GPUs and Kubernetes today. Since general support for GPUs and containers is a prerequisite for Kubernetes, I'll first start there. I'll then jump back to explain how we've added MIG support to containers and Kubernetes later on. Well, the first thing to note is that unlike most containerized applications, NVIDIA GPU-enabled containers require extra runtime support in order to guarantee that they run on machines with different NVIDIA GPU driver versions installed. If the set of NVIDIA driver libraries in user space don't match the exact version of the NVIDIA kernel module running on the host, then applications linking to those libraries will fail to run. For example, a container shown here will run just fine on a host with the V1 kernel module installed. But fail on a host with the V2 kernel module installed. To solve for this, we provide a package called the NVIDIA Container Toolkit, which takes care to ensure that compatible NVIDIA driver libraries are injected into a container at runtime, as well as given application access to any required GPU hardware. Many of you have likely interacted with the NVIDIA Container Toolkit through a Docker command like the one shown here. This basically says launch a Docker container with GPUs 0 and 1 injected into it and run NVIDIA SMI over them. To make this possible, this command hooks into a component called NVIDIA Docker, which is just one small piece of the overall NVIDIA Container Toolkit that I've been talking about. The toolkit itself consists of a stack of components allowing GPUs to be used by many different container runtimes. Different runtimes hook into different layers of this stack, depending on their integration points. For example, Container D hooks in here, Cryo hooks in here, and Lexi hooks in here. And of all of these components, this bottom one is the most important because it does all of the heavy lifting for injecting GPU support into a container. As such, the majority of the code we added for mixed support in containers lives here. In the context of Kubernetes, one needs to ensure that the container runtime in use is configured to work with the NVIDIA Container Toolkit under the hood. For example, Docker can be configured like this. Container D can be configured like this. And Cryo can be configured like this. And once you have that set up, the component called the NVIDIA K8s device plugin can be installed to allow GPU resources to be requested, as shown here. When invoked, this plugin sets things up so that the NVIDIA Container Toolkit will be triggered under the hood to inject any necessary libraries and GPU hardware into a container at runtime. Additionally, a component called GPU Feature Discovery can be installed to apply labels to a node with the various properties of the GPUs installed on that node. The user can then define a node selector to direct a pod to a node with a specific type of GPU installed on it. In the example here, we are directing the pod to a node with A100 GPUs running CUDA 11 and the NVIDIA R450 driver. So what does this look like in the MIG world then? Well, if this is how we inject full GPUs into a container, then this is how it has been extended to support MIG. Specifically, there is a new colon syntax that allows you to specify both the index of the top level GPU where the MIG device exists, followed by the index of the specific MIG device within that GPU. Of course, you can always specify the full UUID of the MIG device you would like to inject as well. Likewise, in Kubernetes, if this is how you request GPUs in a pod spec, then requesting access to a MIG device can be done similarly. Instead of asking for a resource type NVIDIA.com slash GPU, you instead ask for a resource type of NVIDIA.com slash MIG dash whatever MIG device type you want, in this case, 1G.5GB. In order for MIG devices to be advertised like this, the K8's device plugin must be configured with what's called the mixed MIG strategy. It's so named because a mixture of different MIG devices and full GPUs can all be advertised from the same underlying node with this strategy turned on. This is in contrast to what we call the single strategy, which allows one to continue using the existing NVIDIA.com slash GPU resource type, but make sure that labels are defined such that a user is able to get to the underlying MIG device type that they want. It's so named because in this setup, a node is only able to advertise a single MIG device type across all of its GPUs rather than allowing a mix and match of different types. And that's really all there is to it from the end user's perspective. So long as you've installed these versions of the components seen here, you should have everything you need to enable MIG support in containers and Kubernetes. So how does all this work under the hood? What is the NVIDIA container toolkit actually doing to give a container access to a MIG device? Well, stepping back for a second and looking what it does for full GPUs, it first injects all of the device nodes seen here at the top and then selectively injects device nodes from the bottom, depending on which GPUs the container should actually have access to. It then limits the view of GPU devices under PROC driver NVIDIA GPUs to something like what you see here on the right, where each entry in this folder corresponds to the PCIE bus ID of a full GPU whose device node has just been injected. When moving to the MIG world, you still need to inject the set of device nodes seen at the top. But instead of selectively injecting a single device node from the bottom, you now need to inject three device nodes, one for each component of the three tuple representing the MIG device. But what are these funky names for the device nodes representing the GPU and compute instance? Why aren't they just names for the components they represent? Well, it boils down to a new abstraction provided by the NVIDIA kernel driver called NVIDIA capabilities. Whenever a GPU instance or compute instance is created, a set of capability files are created under PROC, which point to the set of device nodes under DEV, which will give a user access to them. So taking this example that represents the access files for a MIG device on GPU 0. If we cat the access file for the GPU instance, we see that it contains a reference to the minor number of a device node. Likewise, if we cat the access file of the compute instance, we see something similar. And since these devices are all part of this new NVIDIA capabilities framework, we can find the devices they actually represent under the slash DEV NVIDIA caps folder with the admittedly non-obvious name shown here. Now, similar to how we limited the view of slash PROC driver NVIDIA for full GPUs, we also limit the view of this folder for MIG devices, specifically to only see the access files for the GPU and compute instances it has access to under PROC driver NVIDIA capabilities. OK, now I know how to give a container access to a MIG device. But how do we actually create these MIG devices in the first place? What software is needed to take a node and configure a bunch of MIG devices on it so that the NVIDIA container toolkit, K8's device plugin, and GPU feature discovery components can make use of them? Well, unfortunately, it's not quite as straightforward as one might hope. There are actually two distinct workflows you need to consider when configuring a GPU for use with MIG, enabling MIG mode on a GPU in the first place, and then taking a MIG-enabled GPU and configuring it with a set of MIG devices. The commands seen here can be used to perform both of these steps in the case of creating seven 1G, 5GB devices across all GPUs on a node. But what steps are required to enable MIG mode on the GPU? Well, unfortunately, it's not as simple as just wearing the command I showed on the previous slide. Under the hood, a MIG mode switch actually requires a full GPU reset, which not only requires that all GPU workloads are complete, but also that all support clients are disconnected from the underlying NVIDIA driver, including components like the K8's device plugin and GPU feature discovery. Is anyone that struggled through this process in the past knows enumerating this set of clients and reconnecting them after a reset is a real pain? Moreover, when running inside a VM like you would on EC2 or Google Compute Engine, a full node reboot is actually required in order to carry out this reset operation. As such, this operation is considered very heavyweight and should only be carried out very infrequently. In contrast, configuring the actual set of MIG devices on a MIG-enabled GPU is actually pretty dynamic. It does come with its challenges, though. First, the order in which MIG devices are created on a GPU matters. If you remember the diagram I showed you at the beginning of this talk, depending on the order in which you create your MIG devices, you may end up with fragmentation and not be able to create a device that technically should fit on the GPU, but there are not enough adjacent slices available to accommodate it. Additionally, you will need to restart any components that cache previous MIG device state after reconfiguration has taken place. Both the Cades device plugin and GPU feature discovery currently qualify for this. And finally, these last two challenges shown here are the ones most complained about when talking with users of MIG. MIG configurations do not persist across a node reboot, and managing MIG device state across a cluster of machines is not well supported. To address each of these challenges, we've created a tool called the NVIDIA MIG partition editor, or MIG-parted for short. Using this tool, one can declaratively define all of the possible MIG configurations they may want to apply to different nodes around their cluster. They can then make this file available to all of their nodes, and then run this simple NVIDIA MIG-parted command to apply one of the configs from this file to the node. In addition to this base partitioning tool, we also provide a system-D service wrapper around it which can do three things. First, it can persist MIG configurations across node reboots. It can apply MIG mode changes without the NVIDIA driver being loaded. And it can automatically handle the start and stop of any GPU clients across the configuration. And we are planning on releasing a Kubernetes service wrapper soon that provides similar functionality to the system-D wrapper, but can be run directly on Kubernetes instead of system-D. You will then become one of the core components of the NVIDIA GPU operator, which I didn't talk about today, but is one of the key components in NVIDIA's arsenal of GPU support on Kubernetes. And with that, I'm now going to show a demo of all of these pieces in action. I'll first show a Kubernetes setup capable of advertising eight full GPUs on a DGX-A100 box and run a pod to consume one of these GPUs. I'll then reconfigure the box to advertise 56 1G.5 GB devices using the mixed MIG strategy and run a pod to consume one of these MIG devices. Finally, I'll switch things over to the single MIG strategy and run things again. Okay, so this demo is going to be run on a DGX-A100 server with eight 40 gigabyte A100 GPUs installed. To start, all of these GPUs currently have MIG mode disabled, which I can show you using the export sub-command of the NVIDIA MIG-parted tool as shown here. Next, I'll run a command on the host called NVIDIA-SMI, which will print out a summary of all of the GPUs installed on the node. Here we see all eight A100 GPUs being listed. Recall during the talk, I mentioned that the NVIDIA Container Toolkit was a prerequisite to running with GPU support on Kubernetes. This next command just prints out the versions of the NVIDIA Container Toolkit components I have installed on this machine. The versions shown here are consistent with the minimum versions I showed in the talk for running with MIG support. I also mentioned that the runtime and use by Kubernetes needs to be configured for use with the NVIDIA Container Toolkit. Well, for this demo, our Kubernetes setup is running with Docker, and this command just verifies this configuration. As you can see here, the default runtime for Docker is set to NVIDIA. This next command just shows that we aren't currently running any NVIDIA components in our Kubernetes deployment. It also shows that we are running Kubernetes in a single node setup with all of its components running on this single node. Next, I'll show the set of allocatable resources that are currently available in the cluster. As you can see, there are some CPUs, some memory, some disk space, et cetera. You may also notice, though, that there are in fact a few NVIDIA resources listed here that are all set to zero. This is just an artifact of the fact that I've run previous versions of the NVIDIA Cades device plugin on this node and the scheduler hasn't cleaned up any state for them yet. The fact that they're all zero, though, shows that none of these resources are currently being advertised. Next, I'll run a set of Helm commands to set up the appropriate repose to install the latest Cades device plugin and GPU feature discovery components. And then I'll install these components themselves and then print out what has been installed from Helm's perspective. Now that I've done that, I'll show these pods running on Kubernetes itself. I'll then show that the set of allocatable GPUs has been updated to eat and that there now exists a set of labels describing the properties of the A100 GPUs on this node. The most notable things here being the GPU product label and the number of GPUs available. And with that, I'll actually run a pod to consume a GPU and print it out via NVIDIA SMI. And here it is. Now, if you remember, the first command I ran in the demo was NVIDIA MIG Parted Export. I'm just gonna run that again to remind you what its output looked like with MIG disabled on all the GPUs. Now I'm gonna show you the contents of a MIG Parted Configuration file. Where the main thing to note here is the all 1G, 5GB configuration that I have highlighted. This configuration says to take all GPUs on the node and divide them into seven 1G, 5GB devices each. And when I then run this next command, this new configuration will be applied. Note, however, that I am running it with the time command because this command has to disconnect all drivers as well as perform a GPU reset to switch each GPU into MIG mode. This command actually takes about five minutes to complete. And I won't make you all wait, but I'm including the time command here so we can see how much time has actually passed once we come back. Okay, so as you can see, the command took about five minutes to complete. And if I then run NVIDIA MIG Parted Export on it, we can see that the configuration has indeed been applied successfully. And with that in place, I can rerun my commands to print out the set of allocatable resources and the set of node labels that are applied. Or we now see 56 1G, 5GB devices instead of eight full GPUs as well as a bunch of other labels showing the properties of these MIG devices. I can then run a pod to consume this MIG device as shown here. Except that in this case, I'm going to do more than just print NVIDIA SMI. I'm also going to list out the set of device nodes and the limited view of PROC driver NVIDIA capabilities inside the container. As you can see, we have access to exactly one MIG device and all three device nodes representing that MIG device are present. We also only see access files for the GPU instance and compute instance of our particular device and no others. Now, the final thing I want to show you is this working with the single MIG strategy instead of the mixed strategy. If you recall, this strategy allows us to overload the NVIDIA.com slash GPU resource type for MIG devices and assumes a user will use a label to direct a pod at the proper device type. So I first upgrade the plug-in to this strategy followed by upgrading GPU feature discovery to this strategy. And then I print out the set of allocatable resources and node labels again. Where we now see that there are 56 NVIDIA.com slash GPU resource types instead of eight. Likewise, our labels have all been updated to put the MIG properties directly on the GPU resource type rather than in the individual named MIG device. Where the product label now encodes which MIG device this resource corresponds to so user can set their node selector appropriately. If I then go back and run a pod against this setup, you see that I am once again requesting the NVIDIA.com slash GPU resource type. But then inspecting the things you would expect to see when injecting a MIG device. As you can see, everything is present as expected. NVIDIA SMI shows the correct MIG device and all of the device nodes and proc files are present. And with that, that brings us to the end of my presentation. I'll leave this final slide up and I'll continue to answer any questions you have on Slack. Thank you.