 It is now my pleasure to introduce Dr. Gabriel Noahe from NVIDIA Corporation, and he'll be talking a little bit about NVIDIA Cloud solutions. Over to you. Thank you, Mathias. It's great to be here. Thanks everyone for joining this session. So yeah, I'm Gabriel Noaheja, my senior solution architect at NVIDIA. I'm actually based in Singapore, but covering Australia and New Zealand, and I do work with a lot of universities, supercomputing centers, and other entities in Australia and New Zealand. So I'm happy to be here, as I said. And I'll be talking a little bit about the work that NVIDIA is doing to integrate cloud native technologies. So how do we leverage GPUs in containers in Kubernetes? You've seen in the previous talk from Stefan that he mentioned that there are some challenges. I do hope to answer some of those challenges, and if there are still things that you think are not working properly when you're using GPUs, please reach out to me. You have my email address there, and I'll be putting it in the document that Mathias is sharing. So please feel free to reach out to me for any further questions after the session. So with that, let's get us started. First of all, I wanted to start by introducing the NVIDIA GPU cloud. In the previous talks, both presenters talked about different repositories that are available for containers. NVIDIA has started a couple of years ago a similar work to create a repository of containers, which we called NVIDIA GPU cloud at the time, or NGC in short. It might not be the best name. Again, I don't totally agree with our marketing department. It's not a proper cloud per se. So don't think that you can actually get access to a bunch of virtual machines with GPUs despite the name. But what NGC is, is a repository of containers, or at least is how it started. And now we have more than 100 plus containers for high performance computing, deep learning, machine learning. And we are continuously expanding and adding new applications and new containers into this repository. But as I said, this was just the beginning. We expanded since then to add a lot of new things. We are now providing pre-trained models. So for those of you who are doing deep learning, we are providing pre-trained models that you can simply download and use the models as they are, or go a step further and retrain those models using techniques like transfer learning toolkit to actually customize those pre-trained models for your specific workload. We're also providing industry application frameworks. There's a wide range of verticals from healthcare to computer vision to automatic speech recognition. All these verticals, we know that it's pretty difficult to create this type of application or end-to-end solution. So what we're putting together is framework. So basically, sometimes it can be a suite of containers. It can be one big container with all the tools that NVIDIA is providing for GPU acceleration on those specific domains. And we're making this available again under NGC. The other thing that we are providing under NGC is help charts. You might be familiar with help. It's providing an easy way to do deployments. And there are a couple of tools that are available under help charts. You can see there. There are two mentions, Triton and GPU operator. Let me switch to the laser point. Triton and GPU operator. Triton is providing influencing services, which again, it relies heavily on containerization and GPU operator. It's an easy way to prepare your environment for running GPU containers. We're also providing a collection of this yet another feature of NGC where we're putting together different containers, pre-trained models for specific type of workloads. The beauty of NGC is that it works both on-prem and on-cloud. So if you're using NGC containers, you can seamlessly transition your workload or your scripts from on-prem to the cloud or from the cloud to on-prem. And even on the Edge, so you can run these containers on embedded devices, again, which have GPU capabilities. Think about the NVIDIA Jetson series if you're familiar. Those are small embedded modules that can run accelerated code. So if you haven't used NGC before and you're in the area of either HPC or deep learning or machine learning, I highly encourage you to take a look at the NGC catalog. And I'm pretty sure you'll find something of interest in there. Why did we came up with the idea of creating this NGC catalog? First of all, as you might know and as has been pointed out earlier, it's not always easy to for assist admin to install an application. And especially this all started when people started to be interested in deep learning. A deep learning framework might have more than a couple of dozen dependencies when you want to install, for example, TensorFlow. So and you need to get all of those right. And it's not just the, I'm talking about Python libraries. I'm talking about CUDA libraries. I'm talking about the GPU drivers. So there are other things that have to come together in order to have a not only a working version of the framework or of the software, but also performance running version. Because sometimes there are compilation flags. There are things that you need to make sure when you're installing the application that are tuned in order to get the best performance out of your application. So we came with the idea of NGC where we have an entire team of engineers were building this container. So these are built for performance and they're scalable. Now we have a monthly update release cycle. Every month we are releasing a new version of the containers. And this is true specifically for the deep learning ones. For the HPC ones, it really depends on when the application developers are releasing a new version. And the good thing is that being managed by NVIDIA and having an entire team of engineers working on this, we can ensure that basically there's a pretty big quality of service. So you can be certain that from one version to another of the deep learning container, the performance will only go up. Obviously, as I said, we're doing a lot of work and integration with our own libraries, with our own compilers, which again they updated regularly. So with every release of the of an NGC container, this has been integrated into the containers that are published on the NGC catalog. And with every new version, basically, we are improving the performance of those containers. The format everywhere, we do support and I'm going to talk a little bit more about this. We do support Docker, Cryo, ContainerD and Singularity. It was mentioned previously that why don't we, or we've been asked the same question why don't we create the Singularity repository where it's actually easier to have a Docker container and then pretty seamlessly with a single command line to transform the Docker containers into Singularity and as it was mentioned previously, there are a couple of other reasons and it's not that difficult to basically just have a unified repository of Docker images and then use those with other container runtimes. We do support running in multiple type of environments, bare metal and virtual machines and Kubernetes and again I'm going to talk about a couple of these scenarios. And as I already mentioned, we support both multi-cloud on-prem hybrid and Edge. So basically any combination of this will allow you to use NGC. And once again, this enterprise-ready software we do have testing for reliability and regressions. We are scanning the containers for CFC, CVEs or malware and we're publishing the reports so everything is pretty strictly controlled. So there are multiple things that are available. There's a wide range of US cases, there's a wide range of pre-trained models and there's a fairly wide range of resources available in NGC. A new feature that we introduced recently is called credentials that allows you to basically have a tracking of how, for example, a model has been trained or how the upper container has been built. You can see what has gone inside the inside building that container or how a specific model has been trained. And I already mentioned this, we do have end-to-end AI workflows for different type of applications from speech recognition to recommender systems to intelligent video analytics. So these are large domains or verticals for which we're targeting end-to-end frameworks and they only have a couple of them mentioned here. So for example, Metropolis is our end-to-end workflow for smart cities or more specifically intelligent video analytics. And all this work can be deployed everywhere. We have very tight integration with almost all the major public cloud providers. And we have a range of OEM systems that are certified by NVIDIA and we're making sure that the NGC containers are working perfectly on these OEM certified systems. So if you haven't had a chance, I highly encourage you to go to NGC.NVIDIA.com. We have multiple resources like technical blogs, we have webinars and just simply go there, register, there's free registration, you don't need to pay anything to access the NGC repository and take a look and see if there's something of interest for you in there. So NGC is the foundation of our cloud-native approach to deliver consistent deployments. We have multiple things that have to come together to make this cloud-native initiative come true. We are working on multiple fronts, I already mentioned health charts. We are working on integrating GPUs in Kubernetes and I'm going to talk in the next couple of slides on how we do that. We are putting everything together in the form of a tool called the GPU operator that automates the GPU resources in Kubernetes, both the installation, the monitoring and the deployment of a container and GPU support inside of Kubernetes. And we have the NGC HANA registries that are available. I'm going to focus in the next couple of slides on these two topics, mostly Kubernetes and the GPU operator and what are the things that we are bringing to the community to make this integration easier. So first of all, one thing to be aware of, now that I introduced NGC and the rationale for which we created, I'm going to switch gears and talk a little bit about the support of GPUs in containers and specifically in Kubernetes and how we achieve that. And since general support for GPUs in containers are prerequisite for Kubernetes, I'm going to start there. So the first thing to note is that unlike most containerized applications, when you're talking about an NVIDIA GP-enabled container, that requires some extra runtime support in order to guarantee that they're going to run on machines with different NVIDIA GPU drivers version installed. And here's a very simple example. Normally, if you set an NVIDIA driver library in the user space that don't match the exact version of the NVIDIA kernel module running on the post, then the application linking to those libraries will fade to run. So if you have a container, for example, that has the NVIDIA driver library version one and you're going to run on a system with a version one kernel module installed, it will work. But if you're trying to do the same thing on another system where you have version two of the kernel module installed, that's going to fail and the container won't be able to work. So to solve this, we are providing a package called NVIDIA container toolkit, which takes care to ensure that the compatible NVIDIA driver libraries are injected into a container of runtime, as well as given application access to any of the required GPU hardware. So there are multiple layers inside of the NVIDIA container toolkit, and I'm not going to go through all the details, but having the NVIDIA container toolkit is basically a prerequisite to have the GPUs exposed the container runtime. Once you have that, and many of you might have a lively interactive with the NVIDIA container toolkit through a Docker command line, but I want to give you just a one-page overview of how that works. So this basically says when you're running Docker and you just have, you've seen previously how to run a Docker command, or you probably have experienced with that. So in order to run this using the NVIDIA container toolkit, if everything is set up correctly, you're just going to have to add the dash dash GPUs and then the devices which you want to expose inside of the container. So in this case, we're saying that one device zero and one, and here you actually get the details of what those GPUs are. And what this basically says is launch a Docker container, but with the GPU support and inject the support for GPU zero and one into the container and run the NVIDIA-SMI command from that Docker image. So to make this possible, this command hooks into a component called the NVIDIA Docker, which is just one small piece of the overall container toolkit that I'll be talking about. And the toolkit itself consists of a stack of components, allowing the GPUs to be discovered and to be used by many different container run types. Depending on the container runtime, they will hook into different layers of this stack depending on their integration points. So this is how the integration looks like for Docker. For example, this is how container the hooks in here, and this is how the cryo hooks into the same environment. As you can see, they're actually going at different levels, and this is how Lex is also hooks into the same toolkit. Now for all these components, the bottom one is the most important, the NVIDIA-container, because this does all of the heavy lifting for injecting the GPU support into the container. Again, without going into the specifics, you probably know that there are multiple things that have to work and to be done from a container perspective. For this to be possible, you have the device that has to be passed to the container environment in order to be able to simply do this from a user perspective and just type dash dash GPUs and then the devices that you want to expose inside of the containers. Now that we've set up how we are working, how do we support GPUs inside of the container, in the context of Kubernetes, one is to ensure that the container runtime is in use and this configure properly in order to work with the NVIDIA-container toolkit under the hood. Kubernetes itself has ways to do that. For example, in Docker, there's the daemon.json file that has to be edited. So this is how you configure Docker. This is how you configure container. Again, I don't want to spend too much time on this. I'm going to provide the slides and there are a couple of other references that you can take a look at. But just to give you an idea that we are working to integrate as a wide range possible of solution and run times and this is the cryo. So without going into the specifics on how you do the configuration from a user perspective, because this is what most of you might be interested in, once you have everything set up, a component called NVIDIA Kubernetes device plugin can be installed to allow the GPU resources to be requested as shown here. So what you're seeing here is basically a config file where you have the resources exposed and then the resources are basically nvidia.com slash GPU. It says that there are four GPUs and then the specifics of the, for that note. So you have the specific model of the GPU, the CUDA runtime and the CUDA driver that it's being exposed to the Kubernetes device plugin. The other thing that, yeah, so I jumped a little bit ahead. So you have the Kubernetes device plugin that basically allows Kubernetes to be aware of the presence of the GPUs and you also have, there's a second plugin called the GPU feature of discovery and this is an additional component that can be installed to apply labels to a node with the various properties. And again, if you have the GPU feature of discovery installed in Kubernetes, this can be automatically populated because the GPU feature in discovery will do exactly this. We'll get the details of the specific model of the GPU, get the version of the runtime and the version of the driver. So through the node selector, a user can define, so user can define a node selector to direct the post to a node with a specific type of GPUs installed, for example. So if you have a cluster with multiple type of GPUs on showing here T4, V100s and A100s, you can specify here what are the configurations of the node that you want the container to be run on. Also, I'm going to smash the slides a little bit and no worries. One other thing that we introduced recently for Kubernetes, what I presented previously is how to expose an entire GPU inside of a container and Kubernetes. In the latest version of the GPUs, and I'm talking here about the, sorry about the Ampere architecture GPUs, so namely the A100s and the A30 GPUs, we introduced a new feature called multi-instance GPUs. So multi-instance GPU allows you to basically slice a full GPU into smaller components so you can basically create multiple GPU instances on a single GPU. Each of them will have a dedicated number of streaming processors or the execution cores, a dedicated portion of the memory, a dedicated portion of the out-of-cache. And this allows most simultaneous workload execution with guaranteed quality of service. So this is the biggest advantage. So if you're running something on one GPU instance and let's say you have something else on another GPU instance and that second program is crashing on the second GPU instance, that's not going to affect whatever it's running on the first GPU instance. There are multiple ways to do the partitioning. So multi-instance GPU is pretty flexible. You can, depending on the GPU that you're using, so I'm going to refer mostly to the A100, which is our flagship GPU. You can partition the GPU in up to seven different partitions. The work that is running in any of those partitions, as I said, is completely isolated. And we can do the deployment again in different type of environments. You can run this bare metal. You can address these containers or through Kubernetes. And because the main purpose of this talk is how to integrate this with Kubernetes, mostly, I'm going to refer to that. But first of all, let me show you why would you want to do that. One of the things, and we are actually using this a lot internally, and it's actually one of the main use cases for multi-instance GPU is a specific process might not be able, and this is specifically true when we're talking about inference. When you want to run multiple inference jobs, they might not use the entire capability of the entire GPU. So you want to run multiple inference jobs inside of the same GPU. So you might have, for example, small processes that are taking queries, for example, for an automatic speech recognition application. You might have a use case where you actually have a single tenant, but multiple users. So think about the case of, I know, a small development team or university where you have a couple of GPUs, and you have a classroom where you're teaching a specific, either deep learning course or HPC course, and you don't want to give access to the entire classroom. I mean, you don't want to give access to each of those students in the class to a full GPU, just to provide a small application. You can actually divide that GPU into seven big instances, and each student will get its own portion of the GPU. Or you can have, and this is what we're seeing mostly in cloud service providers, this is how they are using the multi-instance GPU capability, where you can actually looking at multi-tenant multi-user. So they want to make the most out of the GPUs, of the infrastructure, and based on the different requirements that the users are having, they want to make sure that the hardware is highly utilized. So why would you want to do this? This is just a very simple performance graph. If one seventh of the A100 GPU is more or less the equivalent of one previous V100. So if you're running with seven big instances on the A100, basically, you're getting seven times the performance of a V100. We've done a lot of benchmarking on this. If you're interested, I can provide you more details. If you're familiar with ML Perf, which is a de facto benchmarking for the deep learning world, one thing that we showed in the latest ML Perf releases, how you can actually put, again, multiple containers and market workloads running on each of those different big instances and all running at full performance. Now, switching back to or getting closer to the Kubernetes in a DGXA100, which is our system with eight GPUs, there are multiple ways in which you can split the GPUs. The easiest and maybe one of the most traditional way of doing it is, you have seven big instances on each A100. Times eight, it will give you basically 56 mic profiles that a user could potentially access. Hypothetically, there might be 56 different users or 56 different jobs that can run on the A100. But having the MIG capability comes with its own challenges, both from the operating systems of this being exposed at the kernel level through a separate method. Normally, you have the NVIDIA devices available under the slash dev, but in the case of the MIG instances, these are exposed under the slash prop. So that brings, again, a new set of challenges because not only have to have the slash dev inside of the passenger side of the containers, but now you also have to have the slash prop, which is providing the capabilities for MIG. So how do we do the integration of MIG into containers and Kubernetes? From a user perspective, it will be, again, fairly simple. From switching, when running, you only have to switch from using the GPU identifier to using basically a sub ID for each of the GPUs. So here you have the same two GPUs that you've seen previously. And when we were not using the MIG mode, you were just passing the device equal 0 and 1 to pass all the GPUs. In the MIG mode here, what we did, we actually created only two MIG instances on each of the GPUs. And we are specifying that we want to pass all the four MIG instances into the Docker runtime. So as you can see, the way to do that is basically 0, semicolon 0, 0, semicolon 1. And this is basically the identifier of the GPU. And I had already this one highlighted. And you're specifying the GPU and then the MIG instance that you want inside of that GPU. Obviously, there's no need to pass all of them. You can create any combination that you want following the index of the GPU and the index of the MIG partition. Alternatively, you could also use the unique identifier, the UID, that is available here to uniquely identify the MIG instance. Now, as you've seen previously, the GPU support inside of Kubernetes was being exposed to the Kubernetes device plugin. Now, what we are doing, and this is already available, we've modified that in order to expose as a resource MIG instance. And instead of simply having just the resource mentioned as a full GPU, you're actually having the MIG resource at the MIG level. So you can have a subsection or a subpartition of the GPU as a resource inside Kubernetes. So in this case, we're basically exposing a MIG partition with five gigabytes of memory. So basically, you can allocate here, or you can run the container on this MIG partition or whatever MIG device type you want. All of this is customizable. There's another tool that it will allow you to do that. In terms of resources, basically, you can have a mixture of different MIG devices or the full GPU can be advertised under the same underlying node with this strategy. So it provides quite a fair bit of flexibility in doing that. This is in contrast to what we call a single strategy. We do provide the option to actually expose here the entire GPU and under the node selector, we are providing the labels, but you need to make sure that the labels are defined such that the user is still able to get to the underlying MIG device that they want. So basically, you have two strategies in which you can expose the MIG capabilities inside Kubernetes. Now, you've seen that there are a lot of things that you need to install and deploy in order to get the support for GPU inside of the Docker runtime and inside of Kubernetes. In order to streamline all that, we came up with a tool called the GPU operator. This is one of the multiple products that our cloud native team works on. Again, it's a tool that is available for free. I'll provide you the links in the slides. So what is the problem that the GPU operator is trying to solve? It enables infrastructure teams to manage the lifecycle of the GPUs when using Kubernetes, so there's no need to manage each node individually. Previously, the infrastructure team had to basically do the configuration of all those different pieces of software that I mentioned previously. So with GPU operator, that allows to create basically, you can use the same CPU gold image and then both on the CPU nodes and on the GPU nodes and on top of that CPU image, we're adding the GPU operator for the GPU nodes. So this allows the customers to run GPU-accelerated applications on immutable operating systems. We are also integrating this with Ubuntu, Red Hat, CentOS, and we're working on a couple of others in the near future. So what is the GPU operator? It's basically an open source package as a help chart, and I mentioned when I was talking about NGC, this is available under NGC, if you're interested, or you can download it separately from the NVIDIA GPU operator websites, and it includes a couple of components here that are sitting on top of Kubernetes and are exposing services to Kubernetes. The first one is the GPU feature discovery, and this labels the worker node based on the GPU specs so that the customer can more granularly select the GPU resources that the application requires. Then we have the NVIDIA driver, and as I said, this is a very quick visit, and you need to make sure that you have the proper driver installed. The Kubernetes device plugin, this is advertised as the GPUs to the Kubernetes scheduler. The container runtime, this is not just the runtime here, but the entire NVIDIA container toolkit, out of which the NVIDIA container runtime is a subcomponent, and the DCGM monitoring. DCGM monitoring is a DCGM in itself as a data-centered GPU manager. It's a tool that that's what DCGM stands for. DCGM is a tool that allows you to do GPU monitoring, and the DCGM monitoring is basically yet another plugin that allows you to do monitoring of the GPUs inside of Kubernetes and then expose that to different tools that are doing aggregation of what's happening inside of your Kubernetes cluster, for example, Goopflow or others. The GPU operator ecosystem, it works again on basically any container platform. It's also open-shift Kubernetes, Anthos, Zoo Stack Edge, Amazon AQS, VMware Tanzu, it supports multiple container ranges and multiple operating systems for a full list of supported software and hardware. I highly encourage you to go to the link here, but if we dive a little bit deeper inside of the GPU operator, it does basically all the steps that I mentioned previously. In the interest of time, because it looks like I'm running a little bit behind, you're going to have when you're doing the GPU operating installation, it will take two multiple steps from installing and starting the NFT service as a demon set, configuring the NFT plugin which enables the labeling, then it will install the NVIDIA container runtime and the driver, then it will go through a validated process, basically we have a container that will validate the installation and then it will do the installation of the device plugin, the DCGM exporter, which is providing the DCGM monitoring capabilities and the MIG manager that exposes the MIG capabilities. All these are installed very seamlessly through a help chart and gets everything installed on the GPU node. It allows you to quickly start from day zero with everything that you need and all the capabilities are installed automatically by the GPU operator. In case there's a need to rebuild any of the images, again it can reinstall an updated driver, it can either install pre-compiled drivers or it can compile the GPU drivers if the channel is changing, so there's a fair bit of capabilities and features when you're going with the GPU operator. The MIG manager as I said is providing all the capabilities for the MIG discovery and again you really need to make sure that you have this if you want to use the MIG capabilities, if you're having an infrastructure that is using A100 nodes and the MIG manager is actually relying on another tool, it's basically a wrapper for the MIG partition tool, which is available on GitHub, so even if you're not using the GPU operator but you want to run for example on a bare metal solution without Kubernetes or anything and you want to have actually an easy way to the MIG configuration and partitioning, I highly encourage you to take a look at this MIG partition editor to expose these capabilities even outside of Kubernetes. And now the last few slides that I wanted to mention, everything that I presented previously is open source and available for free and it's mostly a bare metal solution. Obviously, we are supporting all of this in a virtualized environment and specifically we are having a partnership with NVIDIA with VMware, we were working together to basically transform the data center and address the challenges in bringing the AI support to enterprises, so I know this mostly a research community but if you need enterprise capabilities we basically came up with this partnership with VMware where we're integrating a and this is called NVIDIA AI Enterprise, we are integrating three main components, mainly the accelerated servers which are NVIDIA certified servers, the VMware vSphere with Tanzu, so this is the VMware's own implementation of Kubernetes and the NVIDIA AI Enterprise software suite. The NVIDIA AI Enterprise Suite is basically a subset of MGC which is curated and tested to run on top of the VMware vSphere with Tanzu and to make sure that they're compatible with the OEM servers, so if you're interested in something that provides full support that is enterprise-grade, I highly encourage you to take a look at the NVIDIA AI Enterprise suite and as I said without going into the specifics, we are working with VMware to have all these containers that are available inside NVIDIA AI Enterprise Suite to run seamlessly on top of the Tanzu which as I said is the Kubernetes implementation from VMware. One thing to note here due to the tight integration that we are doing both in hardware and in software, one interesting thing is that basically you can see close to bare metal performance when you're using the NVIDIA AI Enterprise Suite and again if you're having this kind of environment where you're running VMware and you want integration with VMware, I highly encourage you to take a look at this. So with that, these are a bunch of links about the GPU operator menu but from here you can go there subsection in the documentation where you can actually go to and check for some details on all the other components that are part of the GPU operator and Mattias actually asked me to add a one slide on the different products on which you would be able to run the containers with GPU support. Here is a list of everything that is available right now from NVIDIA from a data center perspective. All the applications that I've mentioned previously are only supported on the data center of great GPUs. I mentioned multiple times the A100 and the A30 which are mainly targeted for the compute. We just launched the A2 which is a GPU, a small footprint GPU that is targeting mostly edge AI and inferencing and then there's a range of GPUs for professional graphics and virtualized desktops. There are different capabilities. Again, I'm not going to go to all the details of the different GPUs. If you're interested in how would you choose a specific GPU, just reach out to me and we can have further discussion but from the Kubernetes perspective and the support that we have on the GPU operator, you can take a look at the support matrix and you will see that all these GPUs are currently supported by the GPU operator. Especially if you're having one of the latest GPUs and this is the latest series based on the Ampere architecture, all these are currently and fully supported inside of the GPU operator and the A100 and A30 are the only ones which are supporting the multi-instance GPUs. Again, these are fully supported under Kubernetes. With that, that brings me to the end of my presentation.