 Hello, I'm Kevin Pouget, I'm a Senior Software Engineer in Red Hat Performance and Scale Department in the team in charge of accommodating performance and latency-sensible applications on the OpenShift platform. Hi, I'm David Gray, an Associate Software Engineer at Red Hat in the same team as Kevin. In this presentation titled Benchmarking HPC Workloads on OpenShift, we will discuss using OpenShift Container Platform for deploying high-performance computing applications. In particular, we will discuss the Kubernetes operators and tools we used to set up a proof-of-concept of running and benchmarking two scientific applications on OpenShift. And then we'll go into the performance results we collected. So, first of all, let me introduce why we want to run HPC applications on OpenShift. One aspect I'd like to highlight is the flexibility of Kubernetes, in particular with public clouds. With one simple command, you can deploy a full cluster on AWS or Google Cloud. And then, with one click, you can get new machines to join the cluster. And depending on the machine type you will specify, you will get a powerful GPU, a Manicor CPU, a lot of memory, or, in the opposite, a cheap CPU, depending of your requirements. Then, Kubernetes facilitates the deployment of the payload and their reproducibility. Indeed, in Kubernetes, you don't deploy manually your modification. Instead, you describe the resources you want to have in the cluster and Kubernetes operators will take care to prepare them. These guarantees that you have the same execution environment in your development workstation and in the production clusters. And the usage of containers also improves the reproducibility and these containers, they are used more and more often to package and distribute scientific applications. And the last aspect of HPC is, of course, performance. And presenting the performance of OpenShift is the topic of this presentation. Now, let me present the scientific application we use for this benchmark study. The first one is PegFam 3D Glob, which simulates seismic wave propagation in the EarthQuest. It's a reference application for supercomputer benchmarking, thanks to its good scaling capabilities. And the second one is Gromax that performs molecular dynamic simulation and we use the water data set. Both of these applications support GPU acceleration, OpenMP multithreading and MPI communication for internal parallelism. And in this study, we focused on the MPI communication aspect. Now, I will let David continue with the presentation of the proof-of-concept platform. In this section, I will first introduce the test environment we used for the proof-of-concept and then introduce a couple of the operators we used to deploy Gromax and PegFam. The OpenShift cluster was composed of 37 nodes, three masters, two for infrastructure services like telemetry, routing or storage operators and 32 worker nodes. And forgetting the bare-metal reference performance results, we converted these 32 worker nodes to run row. Each of the nodes has 62 gigabytes of RAM and four physical cores. Hyper threads were not used. Each node has two high-speed Ethernet NICs, one used by OpenShift for the cluster communication and one which we used for some of the MPI traffic. The cluster was running on OpenShift container platform version 4.5.7 based on Kubernetes 1.18. For the bare-metal comparisons, we used row 8.2. For this POC, we used CephFS and OpenShift container storage for a shared file system. OpenShift container storage, or OCS, is software defense storage integrated with and optimized for OpenShift container platform. It is built on Red Hat Ceph storage and can be deployed on-premise or in the public cloud. In our tests, we ran OCS in external mode so that we could use the same Ceph cluster for both the bare-metal results and the OpenShift results to compare apples to apples. Alternatively, OCS can run on an OpenShift cluster. The Ceph cluster was running on four Dell servers with 48 cores, 512 gigabytes of RAM, 1 800 gig NVMe and several 300 gig disks. I'll note here that the performance of the workloads we are running, ProMax and specfam3dGlobe, is not dependent on the file system performance and benchmarking the file system itself was not a part of this effort. Ceph offers a scalable, flexible, and reliable storage solution, which is easy to use on OpenShift thanks to OCS. The message-passing interface, or MPI, is a specification for message-passing libraries and MPI libraries are the de facto standard for writing parallel message-passing programs. Many HPC applications, including Gromax and specfam, are written so that they can be compiled to run across multiple nodes using MPI. The MPI operator enables the easy deployment of MPI programs on Kubernetes and OpenShift. This operator came out of the Kubeflow project and it was originally created for running distributed training for ML models. The operator defines the MPI job custom resource. MPI jobs specify the MPI run command, the hardware resources required for the worker pods, and the number of worker pods. Once an MPI job is created, the operator creates the worker and launcher pods and configures them to communicate using the MPI library installed in the containers. One of the things we'll discuss as we go through the performance results is using Maltis. Maltis is an open-source project that enables Kubernetes pods to attach to multiple networks. It is a container networking interface plugin that calls other CNI plugins. By default in OpenShift, pods use the OpenShift SDN plugin. For running MPI applications, we use Maltis to attach a second network interface to the pods using the Macvlan CNI plugin. This allows us to bypass the SDN and even use a separate physical network for the MPI traffic. For certain types of network traffic, the SDN can have some performance overhead due to encapsulation and open flow rules. In general, throughput is not diminished much, but latency can be affected. Another way to bypass the SDN is to give the worker pods access to the host's network namespace. However, this has security implications as it could be used to snoop on network activity of other pods on the same node if the pod were somehow compromised. Because of this, using Maltis is preferred, and then our tests gave the same performance improvement as using host network. Now I'll hand it over to Kevin for the next part of the talk. Thank you David. So in this section, I'm going to present how we run SpecFam on OpenShift and then how we perform the extensive benchmarking. So one naive way to get SpecFam to run in Kubernetes is to prepare a pod and clone the repository, build it and execute SpecFam. That's nice, but how can we run on multiple machines? Then we can try with the MPI operator using an MPI job. Here we specify the number of replicas of the pod we want to have with currently one worker per node. And again, we clone the repo, build it and ask MPI run to fork and forks of SpecFam. Okay, but what about SpecFam configuration? And how about avoiding to clone and build a one gigabyte repository in every node? And how can we retrieve the files generated by SpecFam? SpecFam run in two stages. First a measure, then the solver. How can we coordinate these two executions? So when we've got so many questions to answer, the Kubernetes solution is to build an operator instead of doing it manually. So this is what I'm going to present in this animation. So it's not exactly an operator that I built, but a go client that will run on the workstation instead of directly in the cluster. But the behavior is the same. So the first step of this work is to prepare a custom resource definition and a custom resource that will specify the properties of SpecFam execution. And then we start the execution of the operator. First step is to build the base image using an OS and cloning SpecFam repository. And then with the configuration from the custom resource, we build the measure and save the measure image, and we're ready to trigger the first parallel execution. This measure job will be connected with a shared volume where we'll be able to share the artifact of the measure execution. And then we can build the solver using the header file generated by the measure and save the solver image. And now we can get the benchmark to actually run using the shared volume and the mesh that is stored there. And this solver will also generate an output log that we will retrieve to the workstation for the benchmarking. So for this benchmarking, we've seen that with this YAML file, we can describe one execution of SpecFam where we provide the problem size, the number of processes to fork, the number of processes per node, and the network type. But we want to run many different problems on different machine counts and with different kind of network configuration. So if we look at the combination, we see that it grows very quickly and even more if we want to have the same result with bare metal. So again, the answer is that we need to automate it, this time not with an OpenShift operator, but with a Python script. From this tool I designed, I call it Matrix Benchmarking because it will benchmark all the different combinations of settings that I provide. So in the first execution, we'll get OpenShift with the software defined network, small problem size, two nodes. And then step by step, we will run all the possible combinations until we've benchmarked the full space. So the tool is split in two parts. First part is performing the benchmarking, running all the possible combinations and capturing the relevant output. So here the execution time. The benchmark ignores the result if they have already been recorded. That allows to stop and restart the complete benchmark at will. And the second part of the tool is a visualization interface based with the Python framework dash and plotly that allows interacting with the recorded data. And the framework allows the developers to focus on the coding of the custom visualization and the parsing and code selection will be reduced. In the following section, we'll see an example of this benchmark where we collected more than 180 measurements only for specfm. So before going in depth in the result, let me summarize where we are now. So we've got an OpenShift cluster with 32 worker nodes that can be converted to bare metal to reuse the same hardware. And we have three different network configuration. The default software defined network which is shared with the rest of OpenShift but secured. We've got the host network flag which also uses the shared network but without any security. And we've got multus that allows us to use the dedicated network. And then we've got the two scientific applications specfm and gromax that can run on the OpenShift and bare metal clusters. So let me present the benchmarking results and I'll start with some micro benchmark figures. So we wanted to understand the different performance of the three networks. So we run OpenMPI micro benchmark that measures the latency and bandwidth of MPI peer-to-peer operations. And in the graph, we see that the software defined network has poor performance in terms of latency and bandwidth whereas multus, bare metal and host network are exactly the same results. And the software defined network poor performance are expected because of the different encapsulation layers for security and routing. Then we looked at MPI collective operations and again we see SDN that is clearly above the rest but for small message sizes we see that the network don't really matter. The collective primitives were executed with 20 nodes involved. Now let me present specfm results and first we'll look at a fixed problem size, 128 necks. In the time graph we see that the benchmark took between 1 hour and a bit less than 10 minutes to execute. And if we look at the time comparison between bare metal and OpenShift we see that the overhead is very limited up to 16 nodes less than 6% and then for 32 nodes OpenShift overhead takes over because the problem to solve is not big enough for the 32 node cluster. And we can also observe that the network type has almost no effect for specfm results and this is thanks to its good computation over communication overlap that is quite advanced. Now if we take a growing problem sizes we see that the time was between a few minutes to more than 3 hours. And if we look at the time comparison we see that for all the problem sizes we are below 4% of overhead. Looking at the efficiency graph we see that we are always very close from bare metal and we're even able to obtain the super linear scale up capabilities of specfm. And I will let David continue with Gromax performance and results. Now I will go over some of the benchmarking results we collected from Gromax. The graph on the left shows the average and standard deviation of 5 or more runs of Gromax at different node counts with various network settings and also bare metal in blue. The performance measured for the hours of computation to simulate 1 nanosecond so lower is better. As we can see the result using the SDN for all of the network communications is the outlier here. To get good performance on more than 2 nodes bypassing the SDN appears to be crucial and using Maltis to do this gives performance similar to bare metal or host network without the security implications of host network. In the graph on the right we compare the benchmark time as a percent compared to bare metal of two configurations. The green line here reflects the performance of Gromax when some cluster infrastructure pods are running on the same nodes as Gromax. This includes pods such as monitoring, the cluster ingress controller and logging. The red line shows the performance of Gromax while isolating these infrastructure pods to two specific infrastructure nodes to prevent these pods from using any CPU cycles during the benchmark run which improves performance and reduces variability. In both of these cases Maltis is being used for the networking. Lastly we show a graph of strong scaling of Gromax. Using Maltis or host network we get strong scaling nearly on par with bare metal. So in conclusion these results show that HPC applications and OpenShift are two great things that can go well together. For some workloads like Gromax it's critical for performance that we take into account the SDN overhead. Fortunately Maltis can be used to bypass the SDN and even use a separate high-speed network for the MPI traffic. Additionally isolating some of the cluster infrastructure pods can help to improve and stabilize the performance further. In our work on the PSAP team at Red Hat we intend to continually improve OpenShift and Kubernetes for efficiently deploying HPC and AIML workloads. Our team is working on maintaining and improving several operators to this end including the NVIDIA GPU operator and the special resource operator. If you're interested in seeing more information about this HPC POC please check out our blogs on blog.openshift.com