 One of the key components of working in the ML and AI space is really the GPU enablement. And we're going to get started because they have a really cool demo, a cool talk that I want them to make sure they get it all in. So without further ado, thank you very much. Thank you. Hi, my name is Mayosh Shetty, and I'm a senior solutions architect at Red Hat. And along with me is Minaz. She's a cluster systems engineer at Supermicro. Before we start, I'd like to give a shout out to Diane Fidima. She is the AI ML lead at Red Hat, and she was instrumental in driving this project from the Red Hat site. So the agenda for today is going to be quick. What we're planning to talk is I'm going to cover why containers and Kubernetes for ML workloads, and in particular, why OpenShift. And I'm also going to talk about how we prepared the system to run the ML workloads. And Minaz is going to talk about the Supermicro hardware and also the results that we collected during our exercise. Before I proceed, I would like to just walk through the pipeline and the various personas involved in ML workload deployment. The first and foremost is we collect all the data. We collect data from various sources. And this is the raw data that's coming in. And the persona here is the data engineer. The next phase, we have the data stored in data lakes. And we have the models which are being tuned. It's tested. It's trained. All that is happening in the second phase. And the persona here is the data scientist. In the first phase, you get to see some trends. But this is where all the training happens in the second phase. Then we take certain models. And then those are deployed using the app dev process. And the application developers are involved in this phase. The data scientists are also involved because they want to make sure that the right models are deployed. Also, they want to make sure if there's any drift in data, because if there's new data coming in, this needs to be some tuning done and some retraining. So the data scientists are also involved along with the app developers. And at the end, what you see is an intelligent application which meets some business objectives. All across all these phases, one role which is common is the IT operations folks. So they're common across all this. They're also responsible for any of the day-to-operations involved in the pipeline. So we're going to talk about why containers and why Kubernetes, in particular, for a hybrid cloud and AI ML workload. First and foremost, the agility that it provides. By this, I mean the automation for the platform and also the model framework that the data scientists are using. All that can be automated. Also, the auto-scaling feature. So with this now, with the auto-scaling feature, the data scientist does not have to rely on the IT folks to provide them the infrastructure. They could basically just use their own tools to auto-scale and get the infrastructure that they need to do their work. So what we've seen is the training, the testing, all this is very compute-intensive. So any hardware acceleration provides a key benefit to the data scientist. So what we've noticed is GPU acceleration, integration with any security features, uptime. All this is a key value add for the ML workloads. The next thing is portability. What we mean by is here is now the containers have been benefiting from this for a long time, meaning you could basically run your workloads from or develop as well as deploy your workloads from any place. Now, you can do that for the ML workloads. Also, now you can offer ML as a service. So we could do this with containers, but now we could do this as a service. So that way, the data scientist does not have to focus on writing all the services into their application code. But to rely on existing services, they could just go to a registry, download these services, and integrate it with the applications and benefit from that. There are a lot of products and services which help a lot. Mainly, we're talking about the automation and the CI CD pipeline that basically that brings in. So all this boost productivity, and last but not the least, is the lifecycle management and operations that will help the deployment of AI ML workloads. Some of you have already seen this slide from earlier. ML workloads are highly data and compute intensive. At the same time, OpenShift is a distributed platform. So that's where they really meet, because we're talking about a workload which is highly compute and data intensive, and we are talking about a distributed system. So ideally, these microservices can be deployed across or orchestrated across the shared resources. Also, like Sharad mentioned earlier, these services can now be behind load balances and descalable. And what we mean by this is you can add more or resources as in when you need them, or even shrink resources when you don't need them. So this is a huge value add. Also, now with OpenShift, the ML workloads can be truly portable, meaning they could be running on your private cloud or on your public cloud, just like the containers are benefiting. Lifecycle management, this is something that is new to the data scientist world. So now data scientists can just focus on writing their code, putting it into a Git repository, and a source to image kind of feature, and the CI CD pipeline that is integrated into OpenShift takes that source, creates an image, and then deploys it. And all the testing that hadn't happened, or the testing hooks that are available in OpenShift, can be leveraged by the data scientists. Also, all the learnings that we've got from the container world can be benefited by the machine learning workloads. So all in all, this makes a very compelling story for OpenShift makes a very compelling story for ML workloads. Let me move on to the actual project that we had done together. So I want to talk about GPU as a service on OpenShift. So before we even started to run our benchmarks, there are some prerequisites that we had to run. So first and foremost, you may have to make sure that you have the GPU drivers running on the servers which are having the GPUs. So to make sure that things are fine, what we need to do is what we did is actually got the GPU names. And we're going to use the GPU names later on in labeling the machines. The Docker, which comes with REL, already has the OCI runtime hooks. So we didn't have to do anything out there. We focused on the NVIDIA container runtime hooks, got that configured. Once that was done, we had our system ready to deploy containers, Docker containers. And we used OCI commands, Docker commands at this point, to deploy these containers on the GPU machines. We tested that with the CUDA vector containers. So once this test passed, what we got to know is that our drivers, our hooks, and our runtime parameters were all fine. So now we are ready to configure OpenShift. So with OpenShift, 3.11 is what we used for this project. The device plugin APIs were already enabled, so we didn't have to do anything out there. So we had to just focus on the device plugins. So we had to make sure that the NVIDIA device plugin was running on the host which had the GPUs. So once we had that configured, we again tested it using the CUDA containers. And I'm going to show you the next slide. Out here, what you see is the last line. This is the YAML file which tells that a GPU is required for this particular container. What happens is the OpenShift scheduler sees this information, schedules it on the node running GPUs. At this point, the Cubelets and the device APIs coordinate and make sure that this particular container gets the resources or the GPU resources that it needs. What at this point we did is we have OpenShift configured. We have the NVIDIA device plugins running on the host. So now we're containerizing our benchmarks. So we containerize the benchmarks using the MLCC tool from Red Hat. We then created the images and pushed them onto the Quay repository. And once we had the benchmark images on Quay, we then ran the performance tests using the images from Quay that we created. At this point, I'll hand it over to Minas. And she can talk about the Supermicro service and the results. Thank you, Mayor. I'm Minas, by the way. I'm from Supermicro. So before I dive into the benchmark numbers here, I just want to tell you a little bit about Supermicro. So Supermicro is one of the leading providers of Super servers throughout the industry in current days. And our headquarters in San Jose, we also have branches in the Netherlands and in Taiwan as well. So Supermicro is one of the leading manufacturers of a huge array of hardware actually, including servers, networking devices, server management softwares, HPC, AI. I mean, you name it, and we provide the whole hardware stack for you. And we also do exciting solution buildings like this. And we are Supermicro's thrilled to partner with Red Hat here, where we have ran real-life AI workloads for the first time on top of OpenShift. So let me start with the solution reference architecture here. So we built a 10-node cluster, which I will show you in detail in the next slide. So for the actual OpenShift building block, we have used our Supermicro's famous Big Twin Super server, which is famous for its very dense parallel compute power as long as its large memory footprint. And for running the actual AI workloads, we have used our Supermicro's GPU server TVRT, which, again, I will provide you the actual spec details of the servers in later slides. And for networking, we have used our own Supermicro switches. We have employed both 10G and 100G switches for this project. And this is kind of a summary of the software stack. For example, if you want to know what rel we used, we used rel 7.6. And as Mayura mentioned, this project was done on OpenShift 3.11 and CUDA versions and summaries like that. So coming back to the solution building blocks, on your left, you have the actual OpenShift cluster building block, which is the Supermicro Big Twin again. Again, I'm not going to go into the whole details of the CPUs and memories, but if you guys have any questions regarding any of the servers, please let me know. I'll get back to you on that. And on your right, you have our Supermicro's GPU server, which can operate up to eight Tesla V100 SXM to GPUs, which are the actual GPUs we use for this benchmarking, as well. Again, if you have any detailed questions about the specs, I'll be happy to answer them. So moving on to the actual hardware setup, in our Supermicro lab, we created a 10-node OpenShift cluster. Standard three masternodes, three infra, and three application nodes, along with one load balancer node. And one of those application nodes, as you can guess, is the GPU server, the actual where we actually ran the AI workload in there. And the network topology, pretty straightforward, we implemented two different layers of network, 10G and 25G. The 10Gs basically was for the management purposes. And 25G was implemented because, as you know, the wider bandwidth and lower latency we can provide to machine learning workloads the faster results we're going to get. And again, this whole solution architecture, reference architecture, and network topology, we built it in such a way that it would be scalable to meet any whatever scale your deep learning project might be. This solution architecture can be scaled up or down accordingly, so it's completely customizable. So for running the actual benchmark suite, we chose, if you're familiar with the world of machine learning, you're very familiar with MLPerf. So MLPerf is a wide range of benchmark suites that covers a lot of the main applications of machine learning. So MLPerf basically gives you a set of rules and some specific data sets and a bunch of specific models so that the results that you produce are comparable across any hardware platform or across any framework. And from the MLPerf suite, we basically chose two categories of benchmarking. The first one is object detection, and the other one is machine translation. And I want to talk a little bit, just a little bit, about the data sets we used. So the first one for object detection we used a data cell called the COCO data set. It's from Microsoft, which contains around 328k images, along with more than 2.5 million labeled instances in those images. And for machine translation, we have used the WMT English to German translation. And this data set basically contains, it's a huge data set containing things like news commentaries or parliament proceedings, so a lot of speeches like that. And for our purposes, we used the English to German translation. So moving on to the actual benchmarking here, as I mentioned, the first one is object detection. So before I talk about the numbers, very basic, the metric that we're comparing here is the training time. And on the very right side, you see that if you go to MLPerf's website, you will see that the only numbers they have published are mainly run on NVIDIA's DGX1 platform. And so what we have done in our lab is the hardware stack that we have created is very much comparable to NVIDIA's DGX1, whether it's CPU cores or number of GPUs, GPU memories, things like that. So we have tried to create a very comparable hardware stack to DGX1 so that we can compare our results with the MLPerf published results. So moving on to the first number here, the first object detection, which is the heavyweight object detection. And that was actually the longest training that we ran. And incredibly, we got even better training time than NVIDIA's DGX1. As you can see, ours was a little around 205 minutes, where NVIDIA's was around 307 minutes. And as you know, this much of a difference in timing makes a huge impact in the real-life AI trainings. And for the next one, which is the lightweight object detection, we have also got very close results. And I will also explain why all these numbers are very important, even if they're not better than NVIDIA's DGX1. I'll tell you that story a little bit later. And I just also want to mention that for all our benchmarks here, we have run it on PyTorch using only PyTorch framework. So the next one moving on is the machine translation, as I mentioned, English to German. And we have ran two different sets of algorithms here, but again, both on PyTorch. So the first one is the recurrent translation, which, as you can see, the training time is very close to DGX1. And the next one, again, is the non-recurrent translation, for which we got the exact same training time as NVIDIA DGX1. And one more thing I do want to mention is that all these results where all these benchmarks were run on MOPerv version 0.6, which is the latest version that they actually just published two months ago. So these are all the very latest version of benchmarks that we have ran. So these are, again, as Sherard mentioned, he also showed the cool demos here. So this is one of the examples of the cool GUI of OpenShift that you can play around with. So both these dashboards were created using Prometheus and Grafana. And as you can see on your left here, you can see the actual GPU usage, how much of each GPU is being used, along with the GPU memory usage. And on the other one, you can see the actual GPU temperatures, which when you're learning training workloads, it's very important to monitor the overall health of your GPUs as long as power usage as well. So the OpenShift GUI, it gives you a lot of really cool stuff to monitor. In fact, every aspect of your project, however you want to monitor it or control it. So this is one of the really cool examples of OpenShift features, I think, that we were able to implement for our project as well. So on this last slide, I want to talk a little bit about why these numbers or what are the impact of the numbers that I just showed you. Well, first of all, to our knowledge, this is the first real-life AI workload that was run on OpenShift. And another very important main point is that the numbers that we're comparing it with NVidia DGX1, they were all running bare metal. So the fact that we can match bare metal numbers with workload running on top of OpenShift, because there is supposed to be a little bit of lag because of bare metal and OpenShift comparison. So the fact that we can match those bare metal numbers, if not better in one case, like I showed you, is a huge statement by itself. So again, just matching those numbers, being able to match those numbers or getting close to those numbers, showcases not only OpenShift's performance, it also shows the overall hardware performance and how well we integrate with OpenShift. And again, the last advantage is from the hardware point of view. NVidia's DGX1 is a very expensive piece of hardware, as you might be aware of, compared to that the supermicro hardware stack that we have developed, which is very comparable to DGX1, is much more cost efficient. So the fact that the customers are getting the same training performance, if not better in one case, in a much more cost efficient way, is another huge statement on its own. So before I finish, I want to share some links with you in this slide. The first one is the white paper that we have jointly published with Red Hat, Red Hat and Supermicro. That white paper has all the details of this project and all the numbers and hardware and everything. And I have also provided the GIT account information here. If you want to go there, you can download all the data sets that we have used, all the YAML files and everything's in the GIT account. And also I've linked the Supermicro's open shift solution page here if you want to take a look at the hardware stack details. So thank you everyone for your time. And I also want to thank Red Hat for inviting Supermicro. It's a huge opportunity for us and we're really thrilled to be partnering with you. Thank you. Thank you.