 All right. So welcome everyone. Our next talk is about GPU workloads on OpenShift. Our speaker is Eduardo Erango, an engineer on the OpenShift engineering team. So take it away. Thank you. So yeah, my name is Eduardo Erango. I live in Colombia, so that's why it's hard to pronounce in a different language. I've been working for the OpenShift PSAP team almost 10, 11 months now. And the main responsibility of our table hardware accelerators, special resources, especially hardware like specific needs, RDMAs, GPUs, specialized CPUs, numerous scheduling, things like that. So we are a team that enables OpenShift to perform on the most difficult and performance-sensitive applications and workloads. So first, I want to introduce runtime hooks. So before we get into running GPUs, an OpenShift on Kubernetes is how we do it. I want to cover all the basics and all the questions that I get from forums and issues. I want to be very broad here and cover from runtime to how we update the operator itself. So first, I want everyone to understand that it's not really magic what we are doing here. The basics of running GPUs or pretty much any special hardware on Kubernetes OpenShift is really going back to containers and is using OCI hooks, right? We are teaching the OCI hook how to mount files, how to configure C groups, how to do specialized things for us. And in the case of hardware, we are teaching the container how to mount drivers, libraries, and things that we need inside the container. So my application can interact with the driver that is outside the container. OCI hooks are, they can be separated in three different ones. That is the pre-start hook, the post-start hook, and the post-stop hook. By the name, you can pretty much guess what they do. The pre-start is a small script that you can set up to run before starting the container, the post, again, post. And the post-stop is mostly like a cleanup script that you can set up for your container. For the GPU that is this stock, we are using these OCI hooks to buy mount devices, binaries, and libraries that are needed to run any GPU application for NVIDIA, right? So this stock is going to be very focused on NVIDIA, not AMD also. I know the title says GPUs on OpenShift, but if you have run GPUs in the past, you know that for NVIDIA, you always need to be aware of the CUDA libraries and things like that, versus AMD is a different story. So for this stock, I'm going to be very focused on NVIDIA GPUs only. And by saying NVIDIA GPUs only, then this is why we need to mount devices, binaries, and libraries like the CUDA libraries inside the container. So if I'm using any AI ML application that is going to need those libraries at runtime, it can find those libraries inside the container and not crash. Another question that we get a lot because the slide before is you are buying mounting files that have like a special permissions in the host into a container that probably is going to be running as root. If you're leveraging UVI images, you are adding a little bit of security there, but if not, we in collaboration with NVIDIA wrote DGX-XC Linux policy for doing this specific GPUs on OpenShift. So you can be sure that when you are running a GPU operator, you are safe, right? And how we're doing that, we're applying a specialized design just for this SELINUS policy that will allow you to share the host files that we're seeing here in blue into the container. So the container can use those files and the files are still on the host kind of when you are running a container and you buy mount and use the ..z at the end of the things that you're buying mounting. This is a very specialized use case of that. So the applications inside the container can use these libraries that have special permissions in the host like they are owned by root and things like that. You can use them inside the container without SELINUS gelling at you saying that that's not secured, you cannot do it. So GPUs on OpenShift. How do we start? This is how a very vanilla OpenShift deployment looks. You have some master nodes, some worker nodes, and by using the machine configs out of the scaler, you can add a GPU node very easily. There are several blog posts on how to do that. What is really happening there is you grab one of your worker nodes, machine configs, and you change your cloud provider instance type. So let's say you're running AWS or Google Cloud, you change the CPU only machine type, instant type to a machine type that you know comes with a GPU inside. But what happens after this? OpenShift will just attach an extra node that you know that it has a GPU, but OpenShift doesn't. So here's where all this presentation comes is OpenShift doesn't know that that node has a GPU. It will only use the CPUs on that node and you need to do a lot of things on top of that to make OpenShift realize that this is a GPU enabled machine. But first I want to talk about Kubernetes operators. There are a lot of talks about Kubernetes operators here at DevCom, so this is just a five seconds slide, but you can go and see other more specialized talks about Kubernetes operators. What is an operator really is we're going to be performing a lot of actions one by one step by step. And this that I'm saying that OpenShift doesn't know that your node has a GPU and you need to teach OpenShift that. So it sounds like a lot of manual steps. And even by doing that with Jamel and applying a lot of Jamel one after the other, it feels very manual. An operator is a way of packaging all that knowledge, all that operating knowledge that you have for a certain application into a binary or an Ansible playbook. So you can have operators as a Go binary or operators as an Ansible playbook. And you can inject all this knowledge of, oh, you need to run this Jamel first. Once this Jamel is OK, then run this Jamel. And then if that Jamel fails, do these. So you are putting all your knowledge on how to run your application into something we call a Kubernetes operator. The good thing that we are going to be seeing in this presentation is that you can even teach your operator to update or keep updated your application. So it can be looking at some hooks or some labels or even the kernel version as we are going to see here to trigger some actions inside your operator. So an operator is something really cool that enables you to attach like all your knowledge on top of Kubernetes and let it run by its own. So here we are going to be using something that is called the special resource operator. But at the end, we're going to see how NVIDIA is using that to enable the NVIDIA GP operator. So first things first, not teach, let Kubernetes know, let OpenShift know that you have something there. You have some features on a node. So we are going to deploy the Node Fisher Discovery operator. What the Node Fisher Discovery operator really does is that it deploys a lot of worker pods on every node. And those worker pods are going to read all the features that are on that node and reach back to the Kubernetes master and say, oh, please, to this node, give these labels. It is going to create a plethora of labels from network labels. So it's going to tell if you have a specialized networking cars, specialized CPUs, specialized GPUs, the kernel version, the OS version. So it's going to create a lot of labels that really helps you and helps Kubernetes OpenShift to better shadow and to better know the state of the cluster on a hardware level. Then the SRO, the special resource operator, as I said, this is like a Jamel after a Jamel. So think on every slide as a Jamel that you are running, you're waiting on the Jamel to become ready running or complete to then run the next Jamel, the next slide. And all this is baked into the operator. So that's how cool is that. Then OpenShift is going to realize, oh, we have a node that has this Fisher.node that Kubernetes PCI. So PCI slash 10 is telling us we have a car that is on the PCI 10 and we know that that is NVIDIA. So we know we have a GPU on that node. So it is going to deploy the GPU driver demon set. Once we are good with that, it is going to do what I was saying at the beginning, the runtime hook. It is going to configure the runtime hooks on this node. So every time around any container on this node, it is going to use the OCI runtime hook by mount the NVIDIA libraries into the container. Once that is ready, it is going to tell how many GPUs you have. The GPU is healthy. The GPU has a good temperature, everything you need to know from the hardware level. And it is going to reach back to the operator. Once all this is ready, the GPU Fisher discovery demon set is going to create more labels on top of that node. Labels that are going to look like this, the NVIDIA.com slash GPU family, GPU memory. So it will create labels that will allow you as a user or as a developer for AIMO type applications. Oh, I have this type of GPU card. So I know this type of GPU card is better for inference. This card is better for training my model that is XYG or this card is better just for developing. These labels will allow you as a developer to be aware of where you need to deploy your applications, knowing that it has a GPU in there. Once all of that is ready, the special resource operator is then going to deploy a lot of relevant pods that give us a better usage and maintenance of the node that has a GPU. It is going to deploy a lot of pods. So don't worry, it's going to deploy the GPU driver demon set, the cryo plugin demon set, the device plugin demon set, the node export and demon set and the Fisher discovery. On a side of these pods, it is going to deploy two extra pods that we call the validation pod. And what it is going to do is really going to do a very small mathematical operation. So we call it the vector add. So it's going to do some vector mathematical operations, but it is going to be doing that on the GPU. And why is really testing that everything is in place and that everything is working and running before proceeding to inform open chief and the user that the operator has been deployed to sexfully. And you are good to go to run your GPU needed applications. So if these test pods don't pass the operator is going to fail over and tell you it's failing. I cannot do a very small vector. Mathematical operations or something might be wrong here. Then once all this is done. The last step, it will create some Prometheus and some Grafana for us. So you can go and check your GPU usage, your GPU temperature, your GPU memory allocation. So it creates, as I remember, something like four to six graphs. So you can perform some operations for that GPU and know who is using that is, is it being used fully is a GPU thermal trolling things like that. So you can do that with a Prometheus and the Grafana that we are going to export as the last step. This step will not happen if this step on the slide before is not completed successfully. So return one instead of zero. Then if I want to deploy my application or I want to build a container so you can run or build depending on what you want to do with open chief. And you just request that it comes with a GPU right so you can use no placement or you can request resources. You can do it both ways. You can use the node selector and use the labels that it has created in this previous steps. Or you can also request open chief. I know I have a GPU resource. Please allocate my code on a GPU resource. And if you have a very specialized node with more than one GPU, you can also request that as far as your resources as we always do with requesting memory and CPU. Then you can add this special label. Oh, please also give me a GPU. The NVIDIA GPU operator. So how does this work? Everything that I've been showing is the upstream project that we manage in my team that is called a special resource operator. But this is really a template operator that we maintain. So other vendors like NVIDIA, they can come fork that repo that is on GitHub and create and bake their special knowledge on their operator. So the NVIDIA GPU operator that you can found on the operator hub that is installed by default in your open chief or OKD installed is really a fork of everything I just said before, but with more NVIDIA knowledge behind it. Right. So the special resource operator is something that is designed by open chief engineers thinking mostly on supporting and maintaining open chief. Then NVIDIA forks it and leverage that open chief knowledge and base inside more NVIDIA knowledge. So you're going to get the same feeling of every every step I said before in this light. But at the end it has some NVIDIA software or some NVIDIA lines that are different from what I said before. So this is a high level overview of how it works is really the same. It will add a runtime hook. It will then run a demon set to realize your runtime hook is set. And we run a demon set to check if you have other GPU nodes in your cluster and then we'll upload all the other parts. So it's pretty much the same. Then let's talk about operators a little bit more. So as I was saying, the nice things about operators is that they allow us to pay all this knowledge into a binary that is going to be running as a pod. And they allow us to sleep in peace thinking that everything is working for us. So let's say someone goes and updates your cure net is your open chief version from four point five four to four point six. That is going to change the kernel. And if you know how to run GPUs, even in your local house, you know that when you're performing an update of your kernel, sometimes you need to also update your CUDA libraries or you need to update something on the NVIDIA side. So the special resource operator or also the NVIDIA GP operator is going to realize that something has changed. How first or NFD worker pod that is running as part of the NFD worker demon set is going to realize that you have a different kernel from the one that is labeled on the node. So it is going to say, okay, this node is labeled with kernel four point something, but I now see a four point something plus one. Right. So it is going to inform the cure net is master to update the labels on that node. When we get this update on the labels, the special resource operator is going to trigger a build using open chief. Okay, the bake in build configs. So you can build your containers on the run inside your cure net is closer without having to have any CI CD install like tecton pile, happy lines or things like that. We are going to be using native open chief. Okay. The resources and in this case, the build config. So once the special resource operator notice that it has been an update on the kernel version. It is going to trigger a build config to build a new container for the new kernel. It is going to store that in the internal image registry. And once it knows that the container is in the image registry is going to trigger an update on the container. And once it triggers, it is going to do a recheck to start updating all to the new kernel version. So here are some links. So first I was talking about the NFD operator. So in order to work this operator, it needs to know the hardware on any every node and it doesn't do it right like we don't want an operator to do a lot of things. So the NFD operator is an operator that is specialized in just that in going into the nodes, realizing what hardware do you have, what type of hardware do you have, and it will label up all the nodes for you. So that's the NFD operator. There's a blog post on how to leverage that for also leveraging open chief build configs to build different contain the same container on different type of hardware in your node, in your cluster. Here's the server operator that is like the main point of the talk and the GP operator so you can see that they are a little bit different. And thank you. Hope I was slow enough. I tend to speak very fast. But I really wanted to give this high level overview of what's the GP operator and how we are running it. That's really cool stuff Eduardo. Could you also drop the links for those in chat just so that people who want to access them can take a look. So this is actually something that's very close to me because this is some work I do as well as like sort of this platform for our data scientists to use GPUs and a question that comes up often is something like multi-GPU support. I saw you mentioned something or if you have like specialized hardware with multiple GPUs on the same node that's possible to do. Have you folks done any research or like kind of work into looking at GPUs across multiple nodes, like being able to utilize GPUs split across different machines. If that makes sense. So what you can have is a very fat node as the NVIDIA DGX, right? They're very expensive ones that they come with three, five, 10 GPUs in the same node depending on the cost. So when let me go back some slides. You're not showing a screen. Yeah, I'm working there first. Yeah. Here. So on this slide I'm showing on the mic pod square. You can request NVIDIA.com slash GPU, right? If you manage to have more than one GPU card, so let's talk about like the hardware card in that node, you can request two or three. But remember as everything if you request two and you only have one GPU in that node, the pod is going to be always pending, never deployed. Yeah, I wonder if there's some scope for libraries or tools I can sort of sit between. Like allow you to interface with multiple GPUs and if you don't have nodes with enough GPUs on them, right? So say I have six different nodes each with one GPU available on each one for whatever reason I have that sort of topology. And I have a workload coming that needs multiple GPUs for a reason, right? You will be talking about the MPI operator again, operators. So you can look into the MPI operator that is part of the open data hub that is also a project red hat. And it will allow you to do training, inferencing on top of distributed GPUs. Got it. Very cool, very cool. So I'm not seeing too many other questions coming in. So I think we can continue the conversation elsewhere in the breakout rooms if people have more questions. Thank you again so much for your time. This was, I keep saying the same thing again. It's really cool stuff and I'm really excited by a lot of this. Yeah, I really like talking about operators so people can understand like everything I just said on all this presentation is baked into a binary and that binary knows how to do everything for us. So it's a very powerful feature on OpenShift. Thank you, Eduardo. Have a great day.