 My name is David Gray and I am an associate software engineer at Red Hat. I work on the OpenShift Performance and Latency Sensitive Application Platform team. In this session, I will be presenting a proof of concept of how some popular high-performance computing or HPC tools can be used with OpenShift Container Platform. The tools I will discuss include the Lester File System, NVIDIA GPUs, and the Message Passing Interface, or MPI. I will be running a molecular dynamics application called GROMACS, distributed across multiple nodes and accelerated by NVIDIA GPUs on those nodes. I will first give some background on driver containers, which are used to enable NVIDIA GPUs and the Lester client. I will then give an overview of several operators that are used to prepare the cluster for running GROMACS with these tools. Then I will talk about how GROMACS can be containerized and deployed, and I'll give some initial performance results and analysis from this proof of concept. I will also show two demo videos, one about the Lester pieces and another to show running GROMACS with the cluster autoscaler. So first to give some context, why am I trying to run GROMACS on OpenShift? With the emergence of big data and machine learning, scientific computing is becoming a more integral part of enterprise computing. Organizations are running more and more HPC-like workloads to create value for their business, and this change coincides with the rise of containers and Kubernetes for deploying enterprise applications. Increasingly, enterprise applications are deployed in the cloud and in the data center as containers and microservices. In the HPC community, containers have also caught on. Containers are really helpful for packaging complex applications along with their dependencies so that they can be run in many different environments. For these containers to be useful, you need a platform to run them on, like Kubernetes. OpenShift is an enterprise-ready Kubernetes platform for the hybrid cloud. And this proof of concept aims to show how HPC applications can be effectively deployed on OpenShift. In this proof of concept, I'm using several operators, including the NVIDIA GPU operator, the special resource operator, and the MPI operator. And more. I want to give a little bit of background on what an operator is for those who are new to OpenShift or Kubernetes. In short, an operator is an extension to Kubernetes. It is a method of packaging, deploying, and managing an application which is deployed on Kubernetes and managed using the Kubernetes API. Operators enable the creation of new object types through custom resource definitions, and they manage those custom resources. One example in this talk would be the MPI operator, which introduces the concept of an MPI job resource and manages MPI objects when they are created by the user. An operator works as a software control loop to set and maintain a desired state of certain Kubernetes objects. For the purpose of running Gromax on OpenShift, a few existing operators really simplify the process of deploying a multi-node GPU accelerated scientific application on Kubernetes, which uses the Lester file system for storage. The NVIDIA GPU operator makes it so that I don't have to worry about installing the NVIDIA drivers in my Gromax containers or configuring a special container runtime. The special resource operator makes it easy to help set up the Lester client on all of my worker nodes so that the pods running anywhere can use a Lester file system in AWS using a container storage interface driver. The MPI operator makes deploying MPI applications on OpenShift pretty easy as well. You don't need to worry about how the different containers will find each other and communicate with MPI. It's about as simple as running an MPI application as it would be if you're given a preconfigured MPI cluster in a lab. Both Lester and NVIDIA GPUs are enabled in OpenShift using driver containers. Driver containers are being used more and more in cloud-native environments to enable special hardware like GPUs, which require out-of-tree drivers, or to deploy kernel modules like those needed for the Lester client. Driver containers are especially useful for pure container operating systems like Red Hat Enterprise Linux CoreOS, which is the default operating system for OpenShift nodes. The special resource operator, or SRO, is an operator which supports the full lifecycle management of an accelerator stack, but in the case of the Lester client, SRO is also useful for managing the deployment of a single kernel module. I created a recipe for SRO to know how to deploy the kernel modules for the Lester client. I'll explain how this was done in a later slide and also show a demo video of deploying and using Lester. Node feature discovery, or NFD, is another related operator, which is used to automatically label nodes in the OpenShift cluster based on their hardware and software features. For example, the NVIDIA GPU operator uses NFD labels to know which nodes have NVIDIA GPUs based on PCI vendor ID. It also labels nodes with their kernel version so that you can know which nodes can run a certain driver container. The NVIDIA GPU operator, which you have probably heard of if you're interested in GPUs and OpenShift, is based off of SRO and is used to deploy and manage NVIDIA GPU drivers, container runtimes, node labeling, and GPU monitoring. The GPU operator device plugin labels nodes with the number of GPUs they have as a resource similar to CPU or memory so that it can be requested and allocated to pods. The pods used to run Gromax use resource requests and limits to ensure that they are scheduled one per node on GPU nodes. The GPU operator also enables monitoring of GPU usage metrics like power and memory and percent utilization to monitor how your application is using GPUs. Those metrics can be viewed in the Prometheus dashboard in the OpenShift console. To know that you know a bit about SRO and driver containers, I'll explain in more details the steps required for me to create a recipe for the special resource operator for deploying the Lester client driver container. Lester is a parallel distributed file system, which is commonly used for high-performance computing and supercomputing. In order to mount the Lester file system on Linux, you need the Lester client, which includes the kernel module. The resources used to deploy the kernel module with SRO are available on my GitHub and will soon be merged into the upstream special resource operator repository as an example. The first step is to define a special resource object, as you can see in the manifest file on the right of the slide. The second step is to create the necessary manifests for SRO to manage. In the case of a single kernel module like the Lester client, the necessary manifests are just a build config and a daemon set. The build config describes how to build the driver container image based on a Docker file. The daemon set defines which nodes to run the container on and the necessary privileges and hosts mounts to enable the kernel module to be mounted and used by other pods on the node. Along with the daemon set is a config map which defines the entry point script for the driver container, which loads and unloads the kernel modules. For this proof of concept, we're using an OpenShift cluster running on AWS because AWS has a Lester file system service called Amazon FSx for Lester. FSx for Lester allows for Lester file systems to be created and linked to an S3 bucket, which enables really convenient and high performance consumption of data which is stored in S3 already. There's a container storage interface driver for FSx for Lester, which really simplifies the use of Lester file systems and Kubernetes on AWS. The container storage interface or CSI is a standard for exposing storage systems to Kubernetes. Normally this CSI driver would require every node to have the Lester client installed on the host, but thanks to the Lester client driver container we deployed with SRO, this part can be done with containers. As far as I know, there isn't currently a general purpose CSI driver for Lester, though that would be a great community contribution if you're interested in that. This demo video shows the CSI driver enables us to create a persistent volume claim to access the Lester file systems from pods. This demo shows the steps required to deploy the Lester client using SRO. I've already used SRO to create a driver container base image with some required dependencies. I first create a namespace for the driver container, then the special resource object, then we enable the Lester client container build to pull the driver container base, and then we create a config map which contains the manifests needed by SRO to build and run the driver container. In the Lester client namespace, we will see the driver container pods are scheduled to the nodes and waiting for the driver container build pod to finish. We can watch the build logs progress. Once the build is complete, we should see that driver containers will start running, and then we can check to make sure that the Lester modules are loaded using LS mod. Now that the kernel modules are loaded, the next step is deploying the AWS FSX CSI driver which requires the Lester client kernel modules. Once the CSI driver is running, we can use it to dynamically allocate a Lester file system in AWS using a storage class and a persistent volume claim. The storage class defines some parameters for the file system, including the S3 bucket that it's backed by, and the persistent volume claim defines the size for the file system based on the storage class. Once these two resources are created, we can look at the CSI controller logs and watch the CSI driver dynamically allocate the file system in AWS, as you can see on the left. And then once the file system is ready, we can deploy a test pod to mount the PVC and verify that it can read and write to the Lester file system. And once the test pod is running, we can see that it will contain the input data for Gromax, which is stored in S3, and it can also write to a test text file. The message passing interface or MPI is a specification for message passing libraries, and MPI libraries are the de facto standard for writing parallel message passing programs. Many HVC applications, including Gromax, are written so that they can be compiled to run across multiple nodes using MPI. The MPI operator enables the easy deployment of MPI programs on Kubernetes and OpenShift. The MPI operator came out of the Kubeflow project and was originally created for running distributed training of machine learning models on GPU nodes in Kubernetes. The operator defines the MPI job custom resource. MPI jobs can be created using a YAML manifest, which defines the MPI run command, the hardware resources required for the worker pod and launcher pods, and the number of worker pods. Once an MPI job is created, the operator creates the worker and launcher pods and configures them to be able to communicate using the MPI library installed on the containers. To containerize Gromax, I used a research tool called Machine Learning Container Creator or MLCC, which was created by some members of the performance and scale team at Red Hat. MLCC makes it easier to create containers with various machine learning and HPC frameworks and libraries. MLCC takes as input a set of packages and then container file fragments for the selected items and also some dependencies are used to construct a container file or docker file for building the container. There's a GUI and a command line interface. The GUI is on the left of the slide. For running Gromax, I used it to create a container file based on Red Hat Universal Base Image 8 for building a container with open MPI and the whole Kuda library. Once I had this base container file, I added the necessary steps to compile Gromax with Kuda and MPI support. The container built with this container file will be used as the launcher and worker pods for the MPI operator and MPI job. Okay, so now the fun part. We have all of the pieces set up on our cluster with the GPU operator and the Lester client driver container, the Lester CSI driver for AWS FSx and the MPI operator. And we have a container image for running Gromax thanks to MLCC. In this short demo video, I will show the process of launching the MPI job for Gromax on OpenShift. In this demo, I will also show how you can use an OpenShift feature called the cluster autoscaler. I configured the cluster autoscaler to automatically scale a machine set to add more GPU nodes to the cluster when there are not enough nodes with the necessary resource requirements to schedule all of the pods. The cluster autoscaler is easy to set up and works to scale nodes appropriately for a created MPI job and will also scale down if the nodes become unused for a certain amount of time. The machine set for availability zone US East 1C controls the number of GPU nodes in our cluster. We create a cluster autoscaler object and a machine autoscaler object to automatically scale this machine set. Then we change the MPI job manifest to run across four GPU nodes instead of two, which we currently have running. Once we create this MPI job, the MPI operator will create the worker pods and we see that two of the worker pods cannot be scheduled. This will trigger the cluster autoscaler to scale our machine set. We can watch the machine set scale up and see in the AWS console on the left that two more G4DN.2x large instances have been created. Once the new worker nodes are ready, we should see the MPI worker pods get scheduled and we can watch Gromax running from the launcher logs. The primary goal of this POC is to demonstrate the functionality of common HPC tools unopened. But what about the performance? This graph shows the results from running a common Gromax example benchmark, which simulates water molecules, in this case with 1.5 million atoms. The CPU results are from several M5.2x large instances on AWS, which have eight VCPUs or four real cores, since we don't want to use hyper threads for Gromax. The GPU results are from G4 instances, which have one NVIDIA T4 GPU and eight VCPUs and four real cores as well. For optimal performance, we would likely want to want more real CPU cores per GPU and more GPUs per node. These results aren't bad and it is nice to see some good GPU acceleration, but at higher node counts, the performance doesn't seem to scale very well. Our testing has shown that multi-node Gromax is pretty latency sensitive and maybe network limited. We can do better by bypassing the software-defined network or the, which is an overlay network in OpenShift. The software-defined network or SDN can have some performance overhead due to encapsulation and open flow rules. In general throughput is not diminished much, but latency can be affected. One way to bypass the SDN is to just give the worker pods access to the host's network namespace. By doing this, we can see much better scalability. However, this has security implications as it gives the pods access to the loopback device and services listening on localhost and could be used to snoop on network activity of other pods on the same node if the pod were somehow compromised. The good news is that the SDN can be bypassed without giving pods these privileges using something called Maltis, which enables us to set up an additional network for the MPI pods to use, ideally on a separate high-speed interface. We don't have numbers for this on AWS yet, but initial tests on a bare metal OpenShift cluster showed that running this same Gromax workload using a secondary network with Maltis gave the same performance as enabling host network. So in this session, we've demonstrated how the HPC tools Lester and MPI and NVIDIA GPUs can be effectively used on OpenShift to run a high-performance scientific application. Going forward, we aim to do more performance testing and tuning of HPC applications like Gromax running on OpenShift with the MPI operator. We hope to explore using Maltis to enable high-speed networking, as well as other ways to improve performance such as isolating infrastructure pods to specific worker nodes. Look out for future articles on the OpenShift blog covering these topics. Thanks for listening.