 My name is Kevin Jones. I am an Invitia product manager. Today I want to walk through with you the GPU and network operator. Just a quick news flash, 15 minutes on the current state of affairs, where they are today, how they're constructed, and how you can utilize those on top of your Kubernetes platforms to take advantage of the hardware acceleration of GPUs and SmartNix. First, before we get to the GPU and network operator, I want to talk to you about the platform I specifically work on, which is EGX. And EGX is a cloud-native platform for scale-out acceleration. And you can see in the visualization here that this goes from the certified systems of hardware at the bottom layer through to the EGX stack itself, which is a common application platform built on Linux and Kubernetes. And you can also notice there that the network operator and GPU operator at this layer, and those are the ones we're going to specifically walk through today. You have the CUDA libraries, drivers, and SDKs that are available. And then you also see that NVIDIA is working on different frameworks to make smart systems more available and more easy to develop applications for. So we have things like Metropolis for smart cities and retail. We have Clara for healthcare, Isaac for robotics, and 5G Aerial for the telco work that is being done on 5G today. You may also have noticed that there are different lines at NVIDIA with the GX moniker at the end. And I'll explain those really quickly. The EGX line is really about small-scale, system-on-ship, embedded, manufacturing robotics-type systems. The DGX line is about scale-up. So remember I said EGX is about scale-out. DGX is about scale-up systems. So a lot of our work in the supercomputers lately with the DGX super pods, that's what those systems are being built for. The HGX line is about cloud service providers, so allowing cloud service providers to provide instances to their end users with NVIDIA GPUs in them. And then EGX is our scale-out platform that I've been discussing here today. Now I'd like to switch gears and start talking about the operators. But before we dig into each of our NVIDIA operators, I really want to touch on the operator framework. So a while back, Red Hat acquired a company named CoreOS. CoreOS had many contributions to the Kubernetes ecosystem. One of the major contributions that they gave was the operator framework, and Red Hat has continued to work on this really beautiful and capable framework for deploying Kubernetes native applications. It's a pattern for how you accomplish this, and you can use different ways to write your operators as well. They can be helm-based operators, they can be ansible ways, they can even be go-based if you get really complex. And each of those different layers have either simplicity in the approach of how you build them all the way up to very complex application capabilities. And the operator framework allows you to do deployment, it allows you to do updates and upgrades of your software stack, complex tying together of different components, and it also allows you to tie in monitoring to your applications as well. And so NVIDIA chose this as a great mechanism to distribute and deploy software that enables their both their GPU accelerators and SmartNX on Kubernetes platforms. And when they're deployed together, they actually automatically enable things like GPU Direct RDMA. And if you have ever configured any of these things mainly in the past with GPUs and SmartNX, you know that there is quite a bit of work and almost black magic to get all the things in the way that you need them. And by also doing this inside of the operator framework, we have made it so that we are able to deploy these capabilities onto many of our partners, Kubernetes offerings. With all that foundational knowledge out of the way, let's dig into the GPU operator. We approached this in three parts. The first was to have our container runtime and driver installed on the host and then plugged into Kubernetes. And then we containerized the device plugin and the data center GPU manager monitor and capability called DCGM exporter. And with this set, we had the core functionality we needed. But the next piece is to actually containerize the runtime and the driver, NVIDIA drivers themselves, so that all of our component tree is running as pods on top of Kubernetes. And this makes the GPU node look like a CPU node. And then the last was to wrap this with the operator so that we can manage lifecycle of all these components. And this enables you to not have to worry about all the different variations of configuration and versioning that goes on. This is the NVIDIA GPU operator. With each of these components, we'll talk in depth about each one. So first is our container toolkit which enables GPU support for various container runtimes. This includes Docker, Cryo, Podman, LXC, Singularity and others. And this we integrated specifically with Linux container internals rather than wrapping to specific runtimes. And we can expose the GPUs and the drivers to containers via this toolkit. The next is the drivers themselves. This is the component most administrators are aware of whether they have done containerization with NVIDIA GPU drivers or on standard hosts. And the driver is really the most critical component to exposing all the capabilities of your underlying GPU to your application layer. And the goal here is to simplify provisioning of the NVIDIA driver with our operator. The next is the device plugin. This is a really important part because it allows the applications to make requests for GPUs as resources via pod specs. And so in our example here, you can actually see this GPU example application is making a request for one NVIDIA GPU and that's what Kubernetes will expose to the application. And I mentioned our data center GPU manager, DCGM Explorer, which enables us to take the telemetry that's being fed back from the GPUs and expose that into Prometheus which gives the administrators of the cluster themselves visibility into their GPUs telemetry much the same way they're getting visibility into their CPUs that are running in the cluster. So this is a really great way to take advantage of the native tooling that's been selected in Kubernetes and feed that with our GPU telemetry. So with all of these core components put together and simplified in the operator, we have a great way to expose and get the best out of our GPUs and our Kubernetes clusters. So next let's talk about the roadmap for the GPU operator. We're working on a number of things, new features that have listed on the side here. So things like upgrade management where we handle driver and kernel updates, handle node reboots if we need to. Those things are improving with each release of the GPU operator. We're also working on disconnected and air gap installations where whether you're restricted via a proxy to get out to certain resources or you have no internet connection at all and you have to pull things from custom registries. Those type of environments are very difficult to work in and we're trying to make it easier to utilize the GPU operator in those environments. We're also working on security capabilities like more granular RBAC controls via roles and bindings for the GPU operator. Nvidia is very conscious of security capabilities within our hardware and our software stack and we wanna make sure that the customers are getting the best and most secure capabilities they can. If you haven't seen yet, the A100 is a beautiful piece of hardware that is the most impressive leap in GPU capabilities thus far. It uses the ampere architecture, micro architecture and there's a capability in the A100 that doesn't exist on any other GPU which is called multi-instance GPU and the A100 is physically capable of being sliced up into seven unique slices of the physical hardware which means you could actually have seven parallel processes running on the A100 card. And so we wanna expose those mid capabilities up via the GPU and take advantage of that capability to share this outstanding piece of hardware. And the last worry we continue to work on is further integration with all the Kubernetes distributions that are out on the market. And we have great integration with those like Red Hat OpenShift but we wanna continue to expand our capabilities to integrate with others. I put up some useful links for you to go and follow up with on the GPU operator. The code is open source, it's available on GitHub at nvidia slash GPU dash operator. There is a getting started document that I've linked to here and you can also reach out to us with any of the questions you have. Now I wanna switch gears and start talking about our network operator. If you were not aware, NVIDIA acquired a company named Melanox. At super computing, Melanox has obviously made a very good name for themselves and we wanna continue that trend with Melanox as NVIDIA's networking business unit. And so with the network operator we've taken a very similar approach that we did with the GPU operator. The idea is that we can simplify for system administrators that are running Kubernetes environments with Melanox SmartNix underneath of them. And we wanna simplify the configuration and expose all the capabilities that make those pieces of hardware everything that they are up to your applications in your Kubernetes clusters. So we chose the operator framework to leverage that again. We use custom resources to define what we need from a componentry standpoint. And the operator framework takes care of reconciling the system itself versus the hardware and the software that's configured. And it automates things like fast networking configuration, RDMA, GPU direct. So we did all of this for the benefit of our end customers. We wanted to simplify the deployment experience so that complex network deployment tasks are taken care of for you by the network operator. They're portable across different Kubernetes platforms and it's a consistent deployment across those platforms. We also wanted to give you some operational efficiency gains because we are now managing the network at a cluster level rather than individual system levels. So the operator itself starts to look at this as a cluster capability rather than individual system units that have to be configured. And we want to put the network automation administration on autopilot. Last, we want to take advantage of the architecture itself. So making the operator aware of the architecture. So we did this when we started, you had this legacy where the Melanoxa OFED driver and the NVIDIA peer memory driver were configured on a Linux system by hand or by automation scripting. And then we containerized in the same way that we did the device plugin for GPUs. We have this Kubernetes RDMA shared device plugin that's containerized. Maltis is what really gives us the capability. Maltis allows for multiple network interfaces on a given pod. And the OFED driver and NVIDIA peer memory driver were exposed as plugins into Kubernetes. And the last piece that we did was containerize both the Melanoxa OFED driver and the NVIDIA peer memory driver and we wrapped that with an operator for lifecycle management. And what you get is the NVIDIA network operator. So let's talk about the individual pieces themselves. The OFED driver container is what loads the Melanoxa OFED driver into the kernel. So you have it prebuilt for the distribution and the kernel that's running on the host, deploy it onto the nodes based on node labels. So in both the GPU operator and the network operator, there are these node feature discovery capabilities that go out and label each of the hosts with the hardware that they have. We expose the container root FS to the host to allow kernel module compilation against updated headers. And then we load the kernel RDMA stack and the Melanox driver stack on container start and we unload it on container stop. The RDMA shared device plugin is how you can run RDMA workloads in Kubernetes. And the shared device plugin is the way that we let pods perform RDMA, exposing RDMA device files to the container in a shared manner. And you can see in our example here that it is requesting a single RDMA device and it's limiting itself to that one RDMA device through its pod spec. The next is the NVIDIA peer memory driver container which compiles and loads our peer memory client driver into the kernel itself. So it loads in VP or mem module into the kernel and unloads it when the container exits. This is a really great capability for us to be able to do this. All for the administrators and all automated via the operator lifecycle manager. So where are we at today? So in December, we're looking at helm deployment. We're using node feature discovery. That's what you see NFT there to label RDMA nodes. And we're also working on secondary network deployment. So you actually have a secondary network that is RDMA configured that we can take advantage of in the cluster itself. So that's our 15 minute quick newsflash on both the GPU and network operator coming from NVIDIA. I really appreciate your time today. I hope this information finds usefulness for you that you explore the GPU operator and the network operator. Feel free to let us know if you have any issues or you have any feature requests that you want to add to this. You can find us in the upstream community working on these operator code bases. We really look forward to hearing from you.