 But thank you so much for coming and doing this today. Okay, first introduce yourself and tell us a little bit about who you are and what you do at Red Hat and all the other good things. All the good things. Okay, my name is Sonko Kaiser. I'm team lead for the OpenShift PSET team. The PSET team mainly is responsible for enabling accelerators on OpenShift and one prime example is the GPU. We also enabled other accelerator cards like solar flare doing this right now with Melanox. In the process of enabling those accelerator cards, we developed a operator which is called the Special Resource Operator, which is the base for the NVIDIA GP operator for the Melanox operator and some other hardware accelerators that are using SRO to enable the hardware. Last week I spoke of the OpenShift Commons briefing about some details how SRO works and how it fits with the NVIDIA GP operator. But for today's session is more hands on a demo how to enable GPUs in OKD using the NVIDIA GP operator and NFD. Okay, I have here a OKD cluster at 4.5. It's pre-installed on AWS. We've seen a lot of installations today, so I don't think we need to see another installation. We have free worker nodes and free master nodes, free CPU worker nodes, and now we need to add a GPU worker node. The easiest thing to do right now is to go to the OpenShift machine API project and look at the machine sets that we have. We have a machine sets for workers and we can take, for example, this machine set and just pipe it into a YAML. And the only thing we need to do if this machine set is change the name, change maybe the cluster machine role to type, and the important setting is, of course, the instance type. I have a pre-populated machine set here, which I use for this demo, and we are using a G4 instance, which has a T4 GPU. On AWS, currently we get either V100s or T4s. T4s are mainly used for inferencing and V100s for training. But because of the low cost and easiest of deployment on AWS, we are using a G4 instance. We can add more settings, but the most important one is the instance type. So we just create our new... Oh, I forgot something to say. Another thing we should look for is replicas. I'm setting it to two. So if I instantiate this machine set, we will get two GPU nodes because I'm later on going to run a multi-node TensorFlow benchmark, so that we have already two nodes running. So just create to get the machine set. And we are seeing we have two desired, two current. They are not ready or available. So in the meantime, when those GPUs are instantiated on AWS, we are going to do another step. So for GPU enablement, we have targeted always Red Hat CoreS or Fedora CoreS, an immutable system, container-only system, to have the Nvidia GPU operator working. And that's where we developed the notion of a driver container. A driver container is kind of a delivery mechanism for out-of-tree drivers via a container. And we are currently working for Nvidia to add Fedora CoreS support, but for now we don't have any containers available, so we are going to build our own driver container. I have a document. Can I use the chat function here to post a link to a document? Absolutely. A link so people can follow and maybe add some comments on this document. This was all a little shared on the Opusher Commons briefing last week. I already checked out the needed repositories, so I'm just going to change the directories in here. We have a folder Fedora and we can run a podman build with the right tag. It's all cached, so we don't need to build it right away and we can add it to our private repository. I explain in detail in the document how to tag the driver and how to use the repository plus the name and how to instantiate the Nvidia GPU operator later on with some settings to make it run. Let's take a look at our notes. Okay, they are not yet up. When those two GPU nodes appear, we have a heterogeneous cluster. In the past, people were manually labeling the nodes as a GPU node or as a CPU node to steer the right containers on the right node. Since Openshift 4.2 and we are currently working on adding to OKD as well, we introduced NFD. NFD is a project which exposes node features to the cluster. For example, CPU flex, PCI devices and other hardware that is exposed. So the first step is to bootstrap heterogeneity so that we know and have the labels automatically applied to our cluster without manual intervention. We are currently in the working or to adding it to the operator hub for OKD so that customers can install it just by one click. But for now we have to check it out and call it via make plus the NFD operator. We changed the branch to release 4.5 and we can just run a pull policy always and make deploy. What this will do, it will deploy NFD masters which are responsible for labeling and NFD workers that are running on the workers for detecting the features. We just need to wait until the connections are made from the NFD master and the worker. All things are running. We can have a look at our nodes. OK, one of the GPUs is coming up, but we can, for example, look at one of our CPU nodes. I'll see describe node. And what we see is all the things that is here with the prefix feature node Kubernetes.io are coming from NFD. We have here the CPU flex one feature. How we using those flags is for optimized workloads. So, for example, if you have math libraries that you optimize with ABX 512, you can steer your workloads to this nodes. Otherwise, you will get, for example, illegal instruction if you're running on a later node or on later CPU. You can extract if SLNX is enabled and for the driver container to work, we need the kernel versions that are running. We need wherever the system is. For example, the operating system release. It's a Fedora 42. So we can, by taking the driver container in a specific way, we can steer only drivers that are pre-built for this kind of kernel version and operating system to the right node. And PCI 1DOF, I don't know what this PCI vendor ID is, but let's have a look if one of the GPU nodes is coming up. I'll see describe node. So this is one of my GPU nodes, and this is the other one seeing by the age. And if we have a look at the GPU node, we are looking for something like PCI 10DE. 10DE is the hex number, the vendor PCI vendor ID of NVIDIA. So now we have all the pieces and the nodes are labeled correctly so that we can steer our NV stack on the right nodes. One thing we need to do is we have currently a wrong cryo config in this version. There's an upstream fix for that. So we need to reload cryo with some corrected config. So that's what we're going to do now to enable the hook directory. How GPUs are enabled in a container is that NVIDIA has written a pre-statue. A pre-statue is called during the run time or stages or life cycles of a container, and the pre-statue is executed just before the command is run. It mounts all the needed libraries, binaries and devices into the container. And since we are here on a non-writeable system, we have no chance to write to slash USR. So we need to change the hook to slash ETC, which is the default where the pre-statue hook is installed. We need to look for the hook steer. So this is the wrong hook steer because we cannot write to slash USR. We need to change this to slash ETC, pre-DW, ETC, write quit, system, CTL, B, reload. And we are on another, on our another node, just change the same settings to pre-statue. So we prepared the host and the node so that pre-statues are placed in the right directory, and cryo is now able to pick up the pre-statue. Otherwise cryo wouldn't find this hook and slash ETC because it will just look in slash USR. The hooks are prepared. We've built the driver container. We labeled the nodes. And if D is ready, the next step is to install the NVIDIA GPU operator. We are currently using it, going to do it via Helm. Also in the working to edit to the operator hub to have it in OKD. So later on, it would be just a click in operator hub and instantiating a cluster policy with the right settings without feeling too much around. But for now, we have to do it via Helm. We're going to create a project for the GPU operator where we can save our stuff. And then Helm install. There are a lot of settings we are setting here, which usually are encoded in the CR when they're instantiating from operator hub. But since we are overriding the driver container, we need to add stuff like platform open shift true that we have the default runtime cryo because the GPU operator works also on Docker and container D. And adding our driver repository we created before a driver version toolkit version. And we are saying we have NFD enabled faults because the NVIDIA GPU operator is also able to install NFD, but this would be an upstream NFD and not the tested NFD that doesn't reverse, which is shipped by rail and which is certified. So let's hit this and OK, some leftovers from before. I was doing a driver and a half hour ago, forgot to clean up some of the pieces. This looks better. One, two, three. OK. As you get part, we have the GPU operator running here. As you get part in GPU operator sources, we have the complete stack starting to enable. Container toolkit will install the pre-stat hook. We have the driver theme set, which is the driver container we just created. Let us switch to project. So this is the one we created before we instantiated with the right driver container so that you're running a Fedora 32 driver container. Then the device plugin is used for exposing the hardware to the cluster. And after each of those steps, we are running a validation step. The NVIDIA driver validation will just run a small CUDA application, CUDA vector app allocating memory doing some computations to verify that the driver is working. After the device plugin is deployed, we are running a device plugin validation allocated an extended resource and running the same CUDA vector app to verify that the device plugin and CUDA is still working. Custom node exporter for NVIDIA metrics. I will show later on Prometheus integration for NVIDIA and set before the toolkit is for the pre-setup. We have a bug in the NVIDIA device plugin theme set. It's not restarting on error. This is from the early days where people were not using NFD and have the device plugin running since the theme set. It will run on any node. And so the theme set were running on a CPU node. And NVIDIA decided to adjust sleep on an error, which should be fixed in the next release. We can just also delete those ports for a restart. The theme set are not running to completion. They are just sleeping here and not at first should have checked that the NVIDIA driver theme set is... Oh, it's not yet ready. So what the driver container is currently doing now, it's building the kernel modules on the fly on the cluster. The other step was just to prepare the container with some... with the source code and the tool set we need. And the driver container will then figure out on the cluster when it's running the kernel version. Install kernel devil headers and other tools that it's need for building. And then build the drivers and the kernel modules on the fly on the cluster. Okay, one minute. Okay, wait 48.99. So this is good. So let's just look for the other driver theme set. Then let's get part. And we'll see logs F to start. Device plug-in, the device plug-in validation is going to be restarted. It's completed. Logs, you can see that the test path is done. We can wait for a restart of the NVIDIA driver validation, but we can also delete it. Make it faster. Of the logs of the NVIDIA driver validation should also say test path. And the complete stack is completed for the paths that should be completed and all the other paths are running. We can now look again at the nodes. I'll see this scribe nodes and one of the GPU nodes. Just to have a look on the exposed send it resource. You see here, we have a capacity of one GPU and allocatable of one. So we are now ready to run a GPU workload and a port allocating this extended resource. This I have prepared a MPI workload which runs TensorFlow distributed by Horowitz. Pretty simple. We just need to deploy the MPI operator which creates a CRD MPI job. We can instantiate a CR with an MPI job. It's all documented and linked in the document I shared. And let's just have a look. It is deployed. Okay, it's running. Next time I should switch the namespace. But as we see, we have here a launcher, which is the part that launches the MPI jobs and for each node we are getting a MPI job on a worker running. Let's just wait for pulling those images. They're pretty big. Then we can take a look at the logs that they're actually using a GPU and doing some training. Okay, what also works with GPU is auto scaling. You can create a auto scaler that references this very machine set that we created here. If a port comes in with Minimax and if a port comes in and Kubernetes scaler sees it's impending the auto scaler plus auto scaler will kick off on AWS will kick off and create a GPU node. The NVIDIA GPU operator will take care of installing the NVStack. The NFT will label it so the operator knows where to deploy all the pieces and the workload can run. For all the steps that are needed for this one, I linked it in the document I just shared. Let's just wait. But maybe this would be a good pause to answer some questions if there are some questions coming up. Actually, I think you're doing a pretty wonderful job of doing this and not a lot of questions are coming in. However, the document that you shared with the nodes, I think a few people are having some technical difficulties getting into it. I'm wondering if either after this demo, you can make this into something that's more public than the Google Doc so that we can have access to it maybe somewhere in the OKD repo or wherever Christian and the team think that. If there's something that's more publicly available, I'm happy to share it and update it as we are proceeding and adding all the parts that I have mentioned here that are still work in progress. So there's this great hackMD.io tool, which is like an etherpad, but it uses Markdown the same way GitHub does. You can actually store it in GitHub as well. And that you can put permissions on there similar to Google Docs, but it's more collaborative and more open. So maybe that's a good choice. Otherwise, a PR to the OKD repository would be super. Yeah, that's what I put it in the hackMD. We can all hack on it and add to it and get our feedback on it. But I think the long like getting into the docs, how to configure GPUs and stuff like that eventually, I think is a not long term medium term goal. As well. So we tend to do a lot of documentation by blogging. So we can clean some of that up. Yeah. It's a red hat hat. Yeah, maybe I should do a OKD GPU blog, but since I'm covering a lot of the... Yeah, maybe I should do it on the OKD. There are also some documents on how SRO and the NVIDIA GP operator work internally. If you're looking on an open shift for how to enable hot accelerators, there are several parts describing what a driver container is, how to use entitled builds, because SRO was always a template for hardware vendors to enable them because most of the steps will always repeat, like enable the hardware with the driver, which is a driver container, because as you can deploy a driver container on Reddit Core, as Fedora Core, as REL7, REL8, even on Ubuntu. So you don't have to fiddle around with the host because we said from the beginning, don't touch the host at all, keep it in a container, because then it's far easier for operator to pull those images or update the drivers and with the conjunction with NFD, we have all runtime information there to easily tag or pull the right container into the cluster. And then after the driver container is deployed, we have the device plugin, which exposes the hardware. Then we have the note exporter, which exposes metrics. And we also had in SRO a custom Grafana, which also exposed a GPU dashboard. And Nvidia is currently thinking of adding this as well, so that they have maybe their own Grafana dashboard just for GPUs and their metrics, what they have, because it's far lot of metrics that they can expose, like NVLink metrics, then if you're doing multi-node, all the stuff that you're bent with from node to node, and then all the GPU metrics and alerts, if you're overheating, if you're not using the GPU because they cost a lot of money. So a lot of stuff that can be added to the GPU operator, so we are still working with Nvidia to add the features that we already had with SRO. Besides that, as I said, there's always this template repeating, and we've developed SRO in such a way that you could just templatize. It's all templatized. You don't have to write any go code to enable another exorator. And that's what we've done with Melanox, for example. To enable RDMA, the only thing we supplied is a custom CR or some manifests, and the operator takes care of it, ordering state transitions, RBAC rules, and stuff like that. So we are at the end, TensorFlow Run here, GPU Workload Run multi-node with MPI and Horowitz distributed. And that's the end of my demo here. Awesome. I'm happy to answer any other question if they came up. Oh, I think... Is it... I'm not sure whether... Can you boot into the OKD dashboard from here? Just a... And on the screenshot, proving out that it is actually deployed? That is what deployed exactly. What that means. And it's tiny, tiny font that you have in that screen, so... Yeah, yeah. It's just a second. Yeah, that's okay. And you did make it look easy. So hopefully we can get this documentation up and accessible somewhere so that other people can test it out. A couple of people who tested it out. So Joseph Meyer, who is, I think, the next speaker, he's done that. He's done the document and made it run on his cluster. Our next speaker is, I think, Justin Pitman is coming on board next. And then Joseph will be on after that. Yeah. Yeah. Justin's going to try and attempt to do OKD on over and using IPI, living dangerously on the edge. Yeah, I just wanted to say this has been a feature that has been requested a lot of times, actually, in the working group meetings. Not only by Joseph, but I think others as well. So it's really great to see this works because we weren't able to definitively say it works as if nobody had tested it out. But that's super great to see. If you're supposed to be seeing your other terminal, we're not at the moment. Marco. Yeah, I'm just trying to log in to the console, having trouble. Let me just get me this thing. We'll be monitoring metrics and DCGM. So DCGM is the data center monitoring stack of NVIDIA and is GPU temp, GPU utilization, for example, run queries. And we've seen here that we've done some work. So the metrics are working, workloads, parts, all projects. We were in GPU operator resources. There should be the TensorFlow benchmark launcher. And if you look at the logs, we see the same logs we've seen before in my console. TensorFlow running. What else? I don't know what else to show to prove that it works. I think you've actually done something pretty awesome. And we're really grateful that you took the time today to come and join us. So I know we'll probably hook you up with a lot more questions afterwards. I'll join your email and your contact information. I'll put that out there on the working group. So if people have questions, they can reach out to you directly or post questions. Actually, that's probably a better question. Where is the best way to, if we have questions about the NVIDIA GPU and working with OpenShift and OKD, what's the best route to asking those questions? I'm on the Kubernetes slack. And email is always a good way. And I would say slack on the Kubernetes or email would be the fastest thing to ask questions. And then if there's some other problems that are not related, I can steer the people to the right context if something is missing or we need to do more work. But I'm tracking all the work upstream of NVIDIA. I'm the technical lead for GPUs on OpenShift. So any requirement, I'm happy to help get those things upstream or to have more features included. Perfect. And we're in talks right now for in October to do an ML AI OpenShift Commons gathering co-located with the GTCs event that's going to be virtual then. So we'll probably see a lot more from you and other folks doing interesting workloads here. So, you know, I look forward to all of that. And October should be a really interesting month because we'll have both that GTC event and I think we're going to try and do something with open infrastructure around OpenStack and an OpenShift Commons gathering. So we'll be busy on October and probably the ramp up to that. So also I'll share the link and chat to the video from last week that you did that was a deeper dive into it and we'll get all this up and running and up on our YouTube channel again in a playlist for today and great work. I'm totally appreciative that we all are of all the efforts that go into the Accelerator program. So thanks. So I'm not seeing any more questions anywhere. So what I'm going to ask, I put Charo on the spot. We have 20 spare minutes here about. Which demo, Charo, would you like to try? Rooksef or there was one other one, the MariaDB? Well, they need to go in order because we have to have storage to provision in order to deploy a MariaDB Galera cluster. So we'll start with the Sef operator and we'll go from there. Awesome. We're never at a loss for things to demo on OKT. Right, this is probably going to be a little less polished but we're going to do it anyway. And this, by the way, is what we do to all new employees. We just throw them into the fire pit and soak it until they... Good luck. They can demo anything. I cheat sheet here. The Arata hot seat. OK, so a lot of what you're going to see here, I pulled the configuration that we're going to deploy directly out of the Rook project, which you should be seeing on your screens right now. So what you see me running here, I pulled from here specifically. Let me clear this out if you guys can see the screen on the left. The common YAML file and the operator OpenShift YAML file, those are pulled verbatim from this project. You can see the path here to get to it. And it's release 1.4. It's the latest release. I'm not quite brave enough to run directly out of master at this point. So back to the cluster that we deployed earlier this morning. You can see it is fortunately still healthy. There's a few errors being thrown, but it tends to do that in my home lab, network latency and such. So what we're going to do first is these three worker nodes that we deployed. What I didn't tell you when I deployed them was that I actually deployed them with an unused hard drive attached to the virtual machine. So it installed the operating system. It's using a SATA bus. So it installed the operating system on SDA, but it's got an SDB sitting there that is not currently being used. What we're going to do is we're going to create a SESH storage cluster to serve up block devices on these worker nodes. The first step is we need to label those nodes to give them a role of storage node. And so I just applied that label to them. If I hit a quick OC describe on one of those nodes, I can show you that it now has a role of storage node. So that's step one. We need something to tell Seth what it's going to be working with. Step two is we're going to deploy this common.yaml which, as you can see, is creating a whole lot of boilerplate that the Rook operator is going to need. And one of the things that it did was it provisioned for us a namespace, the RookSeth namespace, which currently is very uninteresting. We're getting ready to make it interesting by deploying the Rook operator. And this will take a little bit for it to bootstrap itself. The operator image right now is pulling down and installing. When the operator is up, it's going to create some workloads, some pods, on each of those nodes that bear the label so that they can discover the resources available on that node. And this will take just a little bit to run. Okay, and there you see the three Discover nodes that are spinning up now. And while those are coming up, let me show you a little bit bigger so that you can see it on the screen. The cluster.yaml file is the thing that actually defines our particular Seth cluster. And again, the Rook project has a boilerplate copy of this for you to take and modify to your own purposes. This is the version of Seth we're going to be running, 15.2.4. We're going to have three monitor nodes running. I have set node affinity on those and that it's also going to be looking for a role of storage node. I've assigned resources for the various components of Seth, a limit and a request, just like you see in a typical deployment. And here is the piece of magic that tells it where to find those devices that it's going to create the Seth storage cluster on. And now our operator appears to be fully bootstrapped and up and running. So the next step is to go ahead and deploy our Seth cluster on top of that. All right, now this is also going to take a little bit and you're going to see a bunch of activity here as the operator provisions this cluster. There's the three monitor instances you just saw spin up. There's the CSI plug-ins. Here in a minute it will start actually dealing with those physical devices and formatting the storage for its own use. Okay, you see these OSD prepares. There's three of them. When they are done, they're going to go into a completed state. Once you see that completed state, then the Seth cluster is up and it is ready for use. It looks like we're still waiting for one of the crash collectors to go into a ready state, but everything else at this point should be usable. So to prove that it's usable, let's take our image registry and let's give it a persistent volume. I've created a storage class here that I'm going to apply. Its name is RookSethBlock and it's going to use the new RookSeth CSI plug-in that we just deployed as the provisioner. So I will apply that to my cluster. Now if I flip back over here, go down to storage, we should have a storage class and indeed we do. Now let's create a persistent volume claim, 100 gigabytes, because I tend to use a lot of container images. We should have persistent volume claim and the key here, you can see now it is bound to an automatically provisioned persistent volume that our Seth cluster kindly handed out for us. We can see that from the command line as well. There it is, 100 gigabytes. Now remember if you were watching previously when we deployed the OpenShift cluster, we gave our image registry an ephemeral volume. We need to remove that ephemeral volume before we give it the new volume. So caveat here, any images that you had put in between, you're going to lose those because we're yanking away the storage. You will have lost them anyway because this is an ephemeral volume. Going to put our registry back into a managed state. We're going to tell it to use the persistent volume claim, registry PVC. We're also changing the rollout strategy of this to a recreate. Because I created a rewrite once volume, the rollout strategy that comes by default is not going to work because it will try to do a rolling deployment. I need this to tear down the first instance and create a new instance so that it doesn't try to violate the rewrite once policy. So we just patched it. If we log back into our cluster and look at the image registry, there we go. We have an OpenShift image registry that's creating and should be binding to that persistent volume claim. So I'll pause there. Any questions on that? I know that was pretty fast. I think you're doing pretty good here. All right. I will now deploy a MariaDB Galera cluster. It's going to be a stateful set. What we're going to do, we're going to deploy this stateful set right here, which is going to create a three-node MariaDB Galera cluster from a customized image that we're going to build and push into the image registry that we just created. The first thing we need to do is build our image, which is the official MariaDB repository. It's going to install 10.4. Our Docker file looks something like this. So we're going to do some things to set up a MariaDB. The real magic happens in this shell script that is going to be run by the image when it starts up that actually provisions the MariaDB cluster, detects whether or not a cluster already exists, if it's the first node in the cluster, so forth and so on. I've got a short tutorial written up on this that you can see, so I won't drain it here. We'll just do the fun and pick it off. The first thing I need to do is make sure I'm logged in. I'm in the right cluster. Important safety tip, always make sure you're logged into the right cluster. I need to expose a route for the image registry. What I just did is I patched the image registry operator to say create a default route so that I can externally get to my image registry. Then I'm going to use Podman, and I'm going to log into that image registry array. It succeeded. Now I'm going to do a Podman build and build our MariaDB image. You see I'm grabbing the route from the image registry to tag my image that I'm getting ready to build so that I can push it to the directory. It generally doesn't run that fast. I ran this just a little bit ago to make sure it was going to work. That's why that build went so fast because it had already actually been built. All it did was add a tag to the already built image. Now we'll push it to the registry. Now our OpenShift cluster in its local cluster registry, it now has our customized MariaDB Galera image. Let's create a namespace for it. Create a service account. We're going to create a service account for MariaDB. The reason is MariaDB is picky about its UID. It's especially picky if it restarts and its UID has changed. It tends to get upset. We're creating a service account and we're actually going to run this privileged with this new service account. So that it can run as any UID. There is likely a better way to do this and so I am open to suggestions. Now I'm going to apply a config map that actually contains the MariaDB server.CNF file. By using a config map to do this, I can modify the configuration of my cluster without having to deploy new images. I'm now going to apply a couple of services. One of them is a headless service that allows the cluster to talk to itself on the necessary TCP and UDP ports. And then I'm going to deploy a load balanced service that will allow applications to talk to the cluster. I'm not sure why that's taking so long to come back. Now before I hit this, I'm going to switch over here so that you guys can see the deployment. So I'm going to deploy the stateful set and what you'll see in the console is a ordered deployment of the MariaDB stateful set. We should see this PVC. PVC bound. Now we have persistent volume and you see the first node in our three node Galera cluster is now starting. So just pause just for half a second. Someone's asking, Frank's asking a question. He missed the very first part of your Rook operator installation and he says he has no Rook operators with his IPI based cluster installation. Is the operator hub filtering operators based on UPI, IPI installation choice? Maybe clarify that for him. Okay, good, good point. I actually deployed the operator from the command line because it doesn't show up in the operator hub. See? Not there. Not sure why. It may be one of the ones that we need to, as a working group, add in. There's a community version of it on operatorhub.io that works with generic Kubernetes, but I think we need to do some work. I think it's one of the ones that's on the priority list for us to work on as a working group. Yeah, and that's why I deployed it by using the operator configuration that is provided in the Kubernetes set examples in the Rook project itself. All right, so our second cluster node is coming up and he's doing ordered startup and an ordered shutdown so that you can gracefully stop and start this cluster and it will retain its state. And when this is done, we have a three node MariaDB Galera cluster that is a full multi-master database cluster running in our OpenShift with provision storage. Wow. Well played. Thanks, Charo. That was pretty awesome. I think, and I keep emphasizing in the chat, too, is, well, some of the operator work is the next things on the roadmap that we're trying to get folks to work on. So getting some of those default operators from Operator Hub into Community. We'll be working on that. So you've filled the time very nicely. Justin Pittman's here and he is going to try and outdo you on bare metal.