 Hi everyone, my name is David Gray and I'm an associate software engineer at Red Hat working on the OpenShift Performance and Latency Sensitive Application Platform team. Hello, I'm Kevin Pouget. I'm a senior software engineer in the same team as David. In this talk, we are going to present the tool set we use to deploy parallel scientific HPC workload on OpenShift container platform. And we'll introduce the performance results we obtain in a proof-of-concept environment using these tools. First, we'll give a bit of background about OpenShift and why you'd want to use it as a platform on which to run scientific HPC applications. Then we'll go over the setup of some useful tools and OpenShift features which can be used when deploying workloads like these, including how to enable the Lester file system, CFFS, the MPI operator, and Maltes CNI. Then we'll go over some of the performance results we've collected so far from the scientific applications Gromax and SpecFam 3D Globe, and some MPI micro-benchmarks we've run as well. The first to give some context, why are we testing high-performance scientific applications on OpenShift? With the emergence of big data and machine learning, scientific computing is becoming a more integral part of enterprise computing. Organizations are running more and more HPC-like workloads to create value for their business, and this change coincides with the rise of containers and Kubernetes for deploying enterprise applications. Increasingly, enterprise applications are deployed in the cloud and in the data center as containers and microservices. In the HPC community, containers and public cloud are also catching on. Containers are really helpful for packaging complex applications along with their dependencies, so that they can be run in many different environments. For these containers to be useful, you need a platform to run them on, like Kubernetes. OpenShift is an enterprise-ready Kubernetes platform for the hybrid cloud, and this talk aims to show how HPC applications can be effectively deployed on OpenShift. Both Gromax and SpecFam are scientific workloads which support multi-threading via OpenMP and parallelization over multiple nodes with MPI. Gromax is a package for running molecular dynamics simulations. With Gromax, we benchmarked with the popular water dataset which simulates hundreds of thousands of water molecules. Typically, Gromax with MPI scales well up to about 20 nodes or so, but beyond that, high-speed networking with technologies like InfiniBand becomes a must. SpecFam, on the other hand, is a geophysics code that simulates earthquakes and seismic wave propagation in the Earth's crust. It's a reference benchmark for supercomputers thanks to its good-scaling capabilities. Although it will not be discussed much in this lightning talk, both of these workloads can be accelerated by GPUs. On OpenShift, this can be enabled using the NVIDIA GPU operator, which is based off of the special resource operator, which will be discussed later in this talk. To run Gromax and SpecFam 3D Globe, we used several operators, including the MPI operator, special resource operator, and more. So I want to give a little bit of background on what an operator is for those who are new to OpenShift or Kubernetes. In short, an operator is an extension to Kubernetes. It's a method of packaging, deploying, and managing an application, which is deployed on Kubernetes and managed using the Kubernetes API. Operators enable the creation of new object types through custom resource definitions, and they manage those custom resources. One example in this talk would be the MPI operator, which introduces the concept of an MPI job resource and manages MPI job objects when they are created by the user. An operator works as a software control loop to set and maintain desired state of certain Kubernetes objects. For the purposes of running parallel multi-node HVC software on OpenShift, a few existing operators really simplify the process. Some of the testing we have done with Gromax used Lustre as the shared file system to store the input data for the Gromax containers. For these experiments, we used an OpenShift cluster running on Amazon Web Services, because AWS has a Lustre file system service called Amazon FSX for Lustre. FSX for Lustre allows for Lustre file systems to be created and linked to an S3 bucket. There is a container storage interface driver for FSX for Lustre, which really simplifies the use of AWS FSX Lustre file systems in Kubernetes. Container storage interface, or CSI, is a standard for exposing arbitrary block and file storage systems to containers. Although this CSI driver is specifically written to run on AWS, a similar approach could be used for an on-premise Lustre file system instance. Normally, this CSI driver would require every node to have the Lustre client kernel modules installed on the host, but we were able to deploy the kernel modules with driver containers. Driver containers are being used increasingly in cloud-native environments to deploy kernel modules or to enable special hardware drivers. Driver containers are especially useful for pure container operating systems like Red Hat Enterprise Linux Core OS, which is the default OS for OpenShift nodes. We are using the special resource operator, or SRO, to deploy the Lustre client Kmod driver container and the AWS FSX CSI driver for Lustre. We've created a recipe for SRO to manage the deployment of this whole stack. This can be found on SRO GitHub page. The benchmark results we will share at the end of this talk are from a bare-metal cluster not running on AWS, and we didn't use Lustre in this environment, but I wanted to highlight what we've done to create an SRO recipe for Lustre to show how Lustre or a similar storage system can be enabled in OpenShift. Most of the benchmarking experiments we've done with the scientific workloads was on an on-premise bare-metal OpenShift cluster. In this environment, instead of using Lustre, we used Ceph FS, an OpenShift container storage running in external mode. OpenShift container storage is a software-defined storage integrated with and optimized for OpenShift container platform. It is built on Red Hat Ceph storage and can be deployed on-premise or in the public cloud. In our tests, we ran OCS in external mode so that we could use a Ceph FS file system which was running external to the OpenShift cluster. This way, we're able to use the same file system for bare-metal baseline benchmarks and our benchmarking on OpenShift to compare apples to apples. Alternatively, OCS can run Ceph on the OpenShift cluster. I will note here that the performance of the workloads we are running, Gromax and specs of 3D globe is not dependent on the file system performance, and benchmarking the file system itself was not a part of this effort. We aim to show how Lustre can be used on OpenShift because of its popularity at HPC sites. Additionally, Ceph offers an alternative, scalable, flexible, and reliable storage solution, which is easy to use on OpenShift thanks to OCS. The message-passing interface, or MPI, is a specification for message-passing libraries, and MPI libraries are the de facto standard for writing parallel message-passing programs. Many HPC applications, including Gromax and specfm, are written so that they can be compiled to run across multiple nodes using MPI. The MPI operator enables the easy deployment of MPI programs on Kubernetes and OpenShift. The MPI operator came out of the Kubeflow project and was originally created for running distributed ML model training. The operator defines the MPI job custom resource. MPI jobs define the MPI run command, the hardware resources required for the worker pods, and the number of worker pods. Once an MPI job is created, the operator creates the worker and launcher pods and configures them to be able to communicate using the MPI library installed in the containers. One of the things we will discuss as we go through the performance results is using Maltis. Maltis is an open-source project that enables Kubernetes pods to attach to multiple networks. It is a container networking interface plugin that calls other CNI plugins. For running MPI applications, we use it to bypass the software-defined overlay network in OpenShift. For certain types of network traffic, the SDN can have some performance overhead due to encapsulation and open flow rules. In general, throughput is not diminished much, but latency can be affected. Another way to bypass the SDN is to give the worker pods access to the host's network namespace. However, this has security implications as it could be used to snoop on network activity of other pods on the same node if the pod were somehow compromised. Because of this, using Maltis is preferred. And in our tests, it gave us the same performance improvement as using the host network. Now I will hand it over to Kevin for the next part of the talk. Thank you, David. So for this second part of the talk, I will present the results of the benchmarking that we performed on our local cluster. First of all, let me introduce our test environment. The OpenShift cluster was composed of 37 nodes, three masters, two for infrastructure management, like telemetry, routing, storage operators, and 32 worker nodes. And for getting the reference time, we converted these 32 worker nodes to run real 8.2 bare metal. Each of the nodes have two high-speed network cards, one used by OpenShift for the cluster communication, and one available for our MPI communications. And the nodes have four cores. We didn't use hyper-threading and 62 gigs of run. In this slide, we can see the result of the MPI microbenchmark, where we compare three different networks of OpenShift. The default software network, software-defined network in blue, multis, in red, and the host network in green. These last two options use a dedicated network for MPI communications. And as David mentioned, the host network isn't a secure option, so it's just here for reference. And last, we show in black the bare metal benchmark, again for reference. And the graph shows the point-to-point latency, band and bandwidth, and MPI all-to-all latency with 20 nodes. And the main thing to note here is that multis in red has performance right on par with bare metal, not clearly visible in the plot because all the bare metal lines overlap. And OpenShift software-defined network has degraded performance and more variability in the measurements, which was expected because it's not a dedicated network. Now let me present the result of specfam benchmarking. In this slide, we see the performance of the cluster for a fixed-program size. This program took 55 minutes to execute on four machines and only eight minutes on 32 machines, as we can see in the time graph. For four machines, the slowdown of OpenShift was only 2%, and it grew up to 12% when OpenShift overhead took over the extra computing power of the 32 nodes. But on the strong scaling chart, we can see that the efficiency remains very good, not far below the bare metal reference and the linear time. And we can also remark that the different OpenShift network have barely no effect on specfam results. And in this second slide, we see the performance result for multiple program sizes with most of the execution time between one and three hours. We can see clearly that when we feed bigger problems to the cluster, we get back very close to the bare metal reference with less than 6% of overhead on most of the runs. And for problems big enough, they supply up to 32 nodes, including the super linear efficiency. And I will let David continue with GROMAC results. Thanks, Kevin. The graphs here show the average and standard deviation of five or more runs of GROMACs. The performance measures the hours of computation to simulate one nanosecond, so lower is better. On the left is results of running GROMACs at different node counts with multis. The green line here reflects the performance while running the GROMACs pods on worker nodes, which are also running on some cluster infrastructure pods. This includes pods such as monitoring the cluster ingress controller and logging. The rest of the results we gathered are with these pods isolated to two specific infrastructure nodes to prevent these pods from using any CPU cycles during the benchmark run, which improves performance and reduces variability. The graph on the right shows all of the OpenShift results compared with bare metal. As we can see, the result using the SDN for all of the network communications is the outlier here. To get good performance on more than two nodes, bypassing the SDN appears to be crucial, and using multis to do this gives performance just about on par with bare metal. This graph shows the parallel efficiency of scaling up GROMACs to higher node counts. As the previous graph showed, we can improve throughput by adding more nodes, but the main thing to show here is that we can get similarly strong scaling with OpenShift as with bare metal. In conclusion, these results show that OpenShift container platform can be effectively used to deploy scientific containerized applications. By using multis to bypass the software defined networking layers on a secondary network, we can get performance that is within a few percent of bare metal. Additionally, isolating some of the OpenShift infrastructure pods to specific infrastructure nodes can also help to reduce overhead. We have more results of running these workloads on OpenShift than we were able to fit into this lightning talk, but we have a couple of blog posts for the OpenShift blog on this topic, so keep your eye out for these on blog.openshift.com for a deeper dive. Thanks for listening.