 Hey everyone. Welcome to KubeCon. My name is Kevin Cluz and this is my colleague Sanjay Chatterjee and today we're going to be talking about accelerating AI workloads with GPUs in Kubernetes. So let's dive right in. We are in the midst of the next industrial revolution from self-driving cars to real-time health monitoring and smart cities. AI is powering it all. And at the heart of this revolution are GPUs and the platform that provides applications access to them. For many, Kubernetes has already become this platform, but we still have a lot of work to do before we can unlock the full potential of GPUs to accelerate AI workloads on Kubernetes. This includes changes to both the low-level mechanisms used to request access to GPUs, as well as the higher-level processes needed to map a set of GPUs to a workload based on its requests, where the biggest change in terms of resource management will be direct support for non-fungible and non-exclusive resources at the node level, and the change at the higher layers will take the form of things like topology-aware placement strategies and advanced multi-dimensional scheduling. So to kick things off, I'm going to start with a brief overview of what it takes to enable GPU support in Kubernetes today. I'll then jump into the details of one very specific use case that could benefit from non-fungible and non-exclusive resources, namely GPU sharing. I'll then introduce a new feature called dynamic resource allocation, or DRA for short, which we see as the enabler for taking GPU support in Kubernetes to the next level. Finally, I'll hand things over to Sanjay, who will walk us through some of the challenges with scaling out GPUs on Kubernetes. So what does it take to actually enable GPU support in Kubernetes today? Well, it takes a mix of host-level components, such as the NVIDIA container toolkit and the underlying NVIDIA GPU driver, as well as a set of Kubernetes-specific components, such as the Cades device plug-in, GPU feature discovery, and the NVIDIA Mac manager. With these in place, one can make requests like the one seen on the right to inject GPU support into their workloads, as well as direct those workloads to a particular type of GPU using a node selector if desired. To ease the deployment of all these components, NVIDIA provides a GPU operator, which I'm not going to go into the details of today, but I encourage you to watch the following talks later this week to learn more. One tomorrow at 3.25 p.m. and one on Friday at 11 a.m. For a history lesson on the evolution of GPU support in Kubernetes, I encourage you to also check out the following talks from various organizations at past KubeCons. Okay, so now that we know how to bring Kubernetes cluster up with GPU support, how do we actually make the most of the GPUs we have available to us? Well, there's five primary techniques that can be used to share GPUs amongst multiple workloads. Time slicing, MPS, MIG, VGPUs, and CUDA streams. Most of these techniques have been available to Kubernetes users in one form or another for quite some time. The one exception being MPS, which we plan to release official support for in the next couple of weeks. So what's the difference between all of these different sharing techniques? Well, as you can imagine, time slicing provides the ability to run several workloads concurrently on the same GPU rather than spreading them across multiple GPUs. Each workload has access to the full capabilities of the GPU, but they alternate in time. MPS, in contrast, provides a method of space partitioning, where instead of alternating workloads on a shared GPU in time, each workload remains resident on the GPU without being swapped off but with only a fraction of its total memory and compute capabilities. MIG is similar to MPS in that the resources of the GPU are space partitions, but they are done so at the hardware level rather than in software, meaning that MIG devices are suitable for multi-tenant environments where MPS is not. Now, VGPUs are interesting in that they can be configured for either time slicing or space partitioning using MIG, with the added property that each VGPU is wrapped on a virtual machine, making them suitable for multi-tenant environments under both configurations. The previous sharing techniques, which provide shared access to a GPU at the system level, the final one, CUDA Streams, is a programming abstraction that allows you to run multiple kernels in parallel from within a single application. And with all of these sharing strategies in place, you can actually layer them on top of one another in order to maximize GPU utilization across all of your workloads. Different setups call for different strategies in terms of how or if GPU sharing should be used at all. And I'm not going to go through this table right now, but I encourage you to read the blog post linked at the bottom of this slide to learn more about the advantages and disadvantages of each strategy in different scenarios. I also recommend checking out these talks to learn more. Okay, so the last thing I'm going to talk about before handing things over to Sanjay is DRA. So DRA is a new way of requesting resources in Kubernetes, available as an alpha feature since 126. It provides an alternative to the count-based API for requesting invidia.com slash GPU2, for example. And what makes it so powerful is that it puts full control of the API to select and configure resources directly in the hands of third-party developers. It also gives one the ability to precisely control how resources are shared between containers and pods, which is one of the main limitations of the existing device plug-in API as it stands today. In the interest of time, I'm not going to dive too deep into the details of DRA right now, but I encourage you to stick around for these two talks later today to learn more. I also encourage you to check out these talks from previous group cuts. In particular, my one from last November, which focuses specifically on DRA with GPUs. And with that, I'll hand things over to Sanjay, who will take it from here. Hello, KubeCon. It is such an honor to be here today as we're living in the most incredible moment of our lives. The world is discovering and falling in love with generative AI. Today, I want to highlight NVIDIA Picasso, a GNI Foundry to build and deploy foundational models for computer vision. I feel so proud to share that Kubernetes is driving Picasso's life cycle from training to inference. And today, I will focus on training and the challenges we have faced when scaling out with GPUs. For the inference part, I hope you got to attend my colleague Jan Chen's talk yesterday on talking about the costs of inference. Now, because this journey on Kubernetes started back in 2020, you can check out our coupon talks from that year for more details. Now, the training workloads are essentially bad jobs that require anywhere from 8 to 512 GPUs. And they all depend on all reduced collective communication between all the GPUs. That's for the synchronous stochastic gradient descent. However, to reliably run multi-node jobs of various scales and priorities from multiple users on a shared GPU cluster, we realize the need to augment Kubernetes with advanced scheduling capabilities. For example, gang scheduling, starvation handling, topology of our placement, fairness algorithms, and much more. Currently, many of these features exist in open-source projects today, but we still need comprehensive solutions in Kubernetes when scaling out with GPUs. Today, I will touch upon the top three challenges, namely topology of our placement, fault tolerance, and multi-dimensional optimization. First, let's look into the challenges with topology of our placement. To satisfy the massive compute demands of generative AI, scale-out clusters need to interconnect thousands of GPUs. Now, inside the DGX node, for example, the DGX A100 node, there are eight GPUs, and they can all directly communicate with each other using NVLink. However, beyond the NVLink domain, multi-level rack and spine switching units with GPU-direct or DMA support can scale out these clusters with hundreds and thousands of DGX nodes. So to schedule multi-node jobs, we have to be aware of two topology-aware constraints. First, an optimal placement that minimizes the hop distance between the nodes, and second, bin packing these multi-node jobs within switching hierarchies to improve cluster occupancy. And within a node, the application launcher also needs to be NUMA-aware so that it can bind the training processes to the appropriate CPU core complex. For example, in a DGX A100 node, there are eight NUMA nodes, but the eight GPUs are connected to only four of them. Hence, topology of our placement of a training application is key, both within and across nodes. Next, let's look at fault tolerance and resiliency. Since GPU clusters constantly operate at peak performance, it is not uncommon that electrical and mechanical components can degrade or fail over time. And these component failures can lead to issues like GPU throttling and even job failures. So we need to detect these signals and isolate the faulty nodes so that we can diagnose and repair these faulty components. However, to operate this at scale, it is imperative that we have to automate this fault-handling lifecycle. The first step is to build observability to the GPU infrastructure with in-band and out-of-band GPU monitoring. NVIDIA's infra monitoring components like DCGM and infinite band diagnostics combined with Kubernetes node problem detector and other Linux utilities can help to surface these system faults to the control plane, typically as node conditions. Now, based on these detected node conditions, a scheduling control flow with fault-handling capability and combined with application-level check-pointing can ensure that the training jobs run to completion. For example, if pods are augmented with an init container that runs pre-flight in host and network-level test, then faulty nodes can be detected even before the application starts running. And if faults are detected after the init phase, then the application may need to be check-pointed and the pods rescheduled to a healthy set of nodes. Hence, automated fault-tolerance scheduling, which is both proactive and reactive, is an essential requirement when scaling out in GPU clusters. And finally, putting it all together, I want to highlight the multi-dimensional optimization problem. Imagine we have a 40-node cluster with two switching units. Now, there are 16 nodes that are currently occupied running a job from user one, and 24 nodes just became available. And there are four multi-node jobs of various sizes and requirements that are waiting to be scheduled, three of them from user one, one from user two. Now, interestingly, depending on the KPI to optimize, the scheduler can end up choosing a different job. To optimize for starvation and topology, it'll choose J1. For occupancy, it'll choose J3. For priority, it'll choose J2. And for fairness, it'll choose J4. So which one should the scheduler choose? For most users, the most important KPI is priority and starvation of their jobs. And for the business objectives, it is occupancy of the cluster. And cluster admins always try to allocate resources fairly amongst users. This is a classic multi-dimensional optimization problem. So we need to think about a configurable, multi-objective optimization framework that will make deterministic decisions by considering all the global constraints in a GPU cluster. And not just when scheduling, but also when finding the optimal victim set for preemption or the optimal node group when doing node reservations. Today, I want to conclude with a call to action. This is a great time to solve challenging problems with JNAI and GPUs in Kubernetes. Nvidia just announced Nvidia Blackwell. Amongst many other features, there's inbuilt hardware support for resiliency. We want Kubernetes to take advantage of that. Kevin has been engaging with the community on the low-level mechanisms to enable GPU resource management. And we will keep engaging to solve the GPU scale challenges as well. For many, this is the Linux moment for Kubernetes. Let's make it happen. Thank you all.