 Hi, I'm Professor DK Ponda from the Ohio State University. Together with my colleague, Dr. Hari Subramani, we are very excited to present our work, which is titled, Designing High-Performance Scalable Middleware for HPC, AI, and Data Science in Exascal Systems and Clouds. As many of you know, we are in a very exciting phase of the high-performance computing. We have the Fugaku system currently delivering like 400 for the two petaflops. And we are aiming for the example of systems in the very near future. It may be coming towards the end of this year or the next year. So in this context, one can try to ask a question, how do we design the software systems for these kind of large scale systems to take care of not only HPC, but deep learning, machine learning, data sciences. So all these things come back to the middle box here, which is called programming models. So there are different programming models for these different kinds of environments. It can be MPI, it can be P-GAS, CUDA, TensorFlow, Python, Chadoop, Spark, et cetera. So now if you take a look at the systems, how these systems are being built with commodity computing environments like we have multi-core mini-core architectures, we have different kinds of networking technologies, different kinds of accelerators. And at the top, we want the application kernels of the applications like HPC, DL, ML, or data science to get the best performance. How do we breeze the gap? And this is where the role of the middle web comes in. And below that programming models, we have the communication library, which is for the runtime, trying to take care of the programming models. And one can design this in a layered manner or one can also take care of co-designing. So there are a lot of co-design opportunities. People have explored over the years. That means you make some enhancements at different layers and put them together. And that leads to very nice designs which lead to performance, scalability, and rigidity. So in this context, let's ask the question, as we are heading into the disaster system, how people are trying to design the MPI, which is like the message passing interface. So that is a very common programming model. And together with that, people are exploring different kinds of combinations like open MP or PEGAS, which is in general called MPI plus X. So if you take a look at this bulleted list items, I don't want to go over all the details, but these are some kinds of requirements or wish list. What about the programming models we design? It should be able to scale from million to billion process. These are point-to-point communication. The job startup should be very good. It also needs to have very low memory footprint because if the programming models are the middleware, it's a lot of memory, then applications cannot run on the systems very well. So then of course we need a very good support for scalable collective communication. These are like for all reduce, reduce, all to all kind of things. And as the nodes are becoming better with more and more number of cores, you also need to balance internal and internal communication. We need to provide support for efficient multi-threading, integrated support for accelerators, fault tolerance, resiliency, quality of service support, et cetera, et cetera. So in this context, we have this project we started. It's called MAPIS2 project. Many of you might be familiar with this project. For many years we have been working on this thing where we almost started working on this at the day one of InfiniVan. When the InfiniVan was introduced in 2020, we were ready to actually start working on the high-performance MPI design at the top of InfiniVan. And prior to that, we are working on some commodity interconnects like Mirinet and Quadric. Some of you might be remembering those names. But since that time, we started actually from the, in 2000 and in 2001, we actually had our first open source version demonstrated at the supercomputing O2, almost like 19 years back now. And over the period, we have been continuously enhancing it as the MPI standard has been evolving, as the newer technologies like Omnipath, IWAR, Rocky, AWS, EFA, et cetera have come. We have been continuously enhancing the software stack and currently these libraries are being used by more than 3,200 organizations in 89 countries. Just from our website, we have more than 1.46 million downloads. It is also available with a lot of other software vendors and Linux distros like Red Ads, 2C, OpenHPC and SPAC. We don't keep track of those downloads. And it's also very widely being used in a lot of top 500 systems. Like some of the examples are indicated here, both on the CPU side as well as on the CPU side. So we have been empowering a lot of these top 500 systems for more than 15 years. Here it shows like our release timeline and downloads as you can see in the last several years, we have been steadily rising in our downloads from our side. Here we show actually the overall architecture of the MAPisto software family. Even though we started with HPC, but now we have expanded into deep learning, machine learning and data science. And this is where we'll be spending a lot of time today. So as you can see, like as below, we provide a lot of support for all different networking technologies. Here we have support for many core architectures including accelerators, NVIDIA as well as GPUs. We have support for MPI, P-Gas and also for hybrid MPI plus X. And this is where we have all our scalable communication runtime. We carry out the research, we come up with the best designs, we publish these. And then in around six to nine months of time, we take these designs into our open source distribution and do a very rigorous testing and then make it available to the community. So there is a over the years as different requirements have come up, we have actually multiple different versions of this library. We call it like MAPisto software family. So the basic one is MAPisto and MAPisto as your, like as your version, you also have an MAPisto X and MAPisto GDR, which is for GPU. So we'll gradually provide some highlights of these different releases and how as an end user, you should be able to utilize these libraries for your needs, for whether it is for HPC or deep learning, machine learning or this is the latest release. We made a few months back, that is our MAPisto 236. It has a lot of features. You can get more details from our website. Let me just try to highlight some of these 236 GA release as some of the features, especially for internal communication, startup, elective offload and also very nice features of performance engineering with MPI. So here it shows some numbers of the very latest AMD Milan system together with implement SDR 200. So as you can see, this is like an internal latency we're able to deliver between process to process within running on the same socket, almost like 190 nanosecond. Within that time, we are able to actually communicate an MPI communication. And here you can see right inside the bandwidth, we're able to deliver around 41.87 gigabytes per second. And this inter node here again is on the infinite band SDR 200 on the AMD Milan platform. We are around like a 1.9 microsecond, halfway around trip like a point to point communication and we're able to come very close to the peak bandwidth which is 22.3 gigabytes per second. So as the systems are becoming very large, when you are trying to run a job, whether it is a traditional HPC or machine learning deep learning, the job startup itself takes a lot of time. So this is what we have been continuously optimizing over the last several years. So here you can see like your left hand side if you see like a number of processes to 3,584 processes, we are able to achieve this just in a very small time. As you can see, even for a very large like a 229,000 processes on 4,000 nodes, MPI in it just takes 31 seconds. Just in 31 seconds your job is ready to run. It is also delivering competitive performance compared to other MPI life space. Now some of you might have been hearing these days especially Mellanox, which is a part of NVIDIA now, they are trying to work on a lot of in-network computing. So especially if you want to do collectives like reviews, already use the barrier, et cetera, these operations can be done within the network itself. So we have done a very tight integration with SAR that is the technologies being delivered by NVIDIA and we have done some further enhancements on top of that. And you can actually, if your system has SAR, you can just use this runtime parameter MB2enableSAR equal to one. And here you can see very good performance. For example, here it is MPI already used. This is on the large scale frontier almost the complete one 7,861 nodes when it was configured a few months back. So here you can see on the X axis, we have message sizes up to two kilobytes. We are able to deliver almost 5x improvement compared to the pure software-based solution. We have a lot of optimized software version implementation of the all-reduce, but compared to that now, the in-network computing support, we are trying to deliver almost by factor of 5x improvement, here a factor of 9x, and similarly here MPI reduce almost factor of 6x. So there are a lot of other features you may not have time to go over all these things. I will strongly recommend you to visit the MAPI's webpage. But over the years, we are also working very closely with University of Oregon and product tools. So there is a software environment called COW. So we have done a very tight integration. The MPI 3.1 standard actually allows something called the MPI-T, which is like a tools interface. So through this tools interface, you can actually make the MPI library instead of a black box. You can actually make it a white box. So that means a lot of performance can be provided to the upper layers. You can do some analysis and then dynamically control the behavior through the control of control. We have this integrated support and that actually helps a lot in terms of performance engineering. When you are running a job, if you see that some place you are losing time, some other place you are not able to, it is taking a lot of memory, jobs are not scaling. You should be able to get all this information and then based on your analysis, you should be able to optimize a lot of internal parameters and get the best performance. Thanks, Dr. Pandan. So far, we were looking at some of the basic features that all MWAP-H2 software stacks offers. Now let's go into some advanced features offered by the MWAP-H2 family of software libraries. In this section, we will look at the MWAP-H2X and the MWAP-H2X with support for the Amazon web services high-performance computing clusters. Now, while the MWAP-H2X has a lot of different advanced features, in this particular presentation, we would just be focusing on some select ones. So the particular ones we'll be focusing on here are the Direct Connect Transport Protocol, the Cooperative Rendezvous Protocol, Optimized Asynchronous Progress Mechanisms and XPMEM-based Collective Operations. Now let's look at each one of these underlying features and try to see what sort of an impact this feature can have on the performance of end applications. So if you look at high-performance computing, scale is an important factor. So you need to go across multiple nodes and you need to be able to go probably up like hundreds or even thousands of nodes. So when this happens, your network transport protocol becomes very critical and a transport protocol which provides low overhead to your communication middleware is essential to achieve this. So here we look at the performance that an advanced transport protocol like DC can offer over your regular reliable connected, which is something similar to TCPIP on the Ethernet world can offer. So here we are looking at the performance of an application called Neuron, which performs dense collective communication patterns between all processes involved at different scales on the blue brain cluster at EPSL in Switzerland. So as we can see, while a small number of processes like 512, the basic connection management and the basic transport protocols are fine as the scale of the job increases, advanced mechanisms that MAPH2X has has significant benefits over the basic mechanisms as you can see. And more details are available in this talk that was presented by Dr. Matthias Wolf at the MAPH2 user group meeting in 2020. So you can see that at very large scales, the overhead for the RC protocol for connection establishment and communications become very apparent. Now let's go to the concept of a cooperative around the protocol. Now, what is a cooperative protocol? So what is so cooperative about this communication? If you look at a typical communication that happens in a high performance computing runtime like MAPH2 or OpenMPI or IntelMPI, what would happen is that Process A would try to send a message to Process B. Process B would be passively waiting for the message to complete. But in the cooperative mechanism, what happens is that both processes participate equally to progress the communication by partitioning the load between themselves. So by doing this, we are able to show significant benefits in performance for large message latency and bandwidth, which leads to up to 19% improvement for the graph 500 benchmark at 15,000 36 process. Now graph 500 is a very popular benchmark used in the high performance computing world. And as with all the solutions that we mentioned here, these are available as part of the MAPH2 family of software libraries. Now, one of the most critical requirements of any communication runtime is the ability to make progress or progress of communication. So obviously when the control is in the communication runtime progress is very straightforward. But what happens when the control is outside of the communication runtime and maybe inside the application trying to perform some compute. So who will progress the communication then? Unfortunately, the answer is no one. And that is where we have tried to propose a new more optimized asynchronous progress design, which is more generic and applicable to different high performance communication interconnects out there in the world. So with such a design, such a generic design, we are able to see up to 33 and 29% improvement in the performance of P3D FFT and the HPL application at fairly large number of processes. So going forward, the XPMM or share address space solutions is something that is very critical and has the potential to improve the performance of communication operations in the runtime by several factors. So here we have used the shared address space paradigm to improve or accelerate the reduction-based collectives that the message-passing interface offers. So here we show the example of OSU AllReduce and OSU Reduce, which are two benchmarks which implement the MPI AllReduce and MPI Reduce primitives. And as you can see the shared address space through zero-copy reduction collective designs in Mbappage 2 provides up to 4x improvement for Reduce and up to 1.8 times improvement for AllReduce at the large message size. Now going forward, let's see how these designs translate to the high-performance computing clouds like the Amazon Web Services EFA HPC instances. So here we are comparing Mbappage 2 versus another popular implementation of the MPI runtime library called OpenMPI. So as you can see for collective communication patterns, critical collective communication patterns like AllReduce and Scatter, the Mbappage 2x runtime is able to outperform OpenMPI by a fair margin. So here we again look at the performance of Mbappage 2x versus other competing runtimes on the Oracle's OCI HPC cloud ecosystem. So here we have measured the performance of collective operations on eight BM HPC2 instances. And we look at the performance of broadcast and the Reduce Collective. So we can see that Mbappage 2x is able to outperform the competing MPI libraries by large orders of magnitude for different message sizes. So with that, let's switch gears and look at more of the GPU computing sizes. So far, whatever we have mentioned have been focusing on accelerating your CPU-based performance and performance on, let's say, large-scale HPC systems. But what about the performance on NVIDIA and AMD GPUs and the associated GPU-enabled deep learning applications? So that is what we are going to be looking at in the upcoming sections about Mbappage 2 GDR. So the Mbappage 2 GDR, again, we make continual releases of all of these software stacks periodically and they are available for free download from our website. So these kind of show some of the more salient features offered by the latest release. So one, a few of the more critical features is the support for on-the-flight compression of point-to-point messages. Now, what does this mean? So GPUs have a lot of computing power. So we are trying to see if we can use some of this computing power to compress a message on the fly and send it out so that instead of sending maybe a 64 megabyte message, maybe we can send a 16 megabyte message and save significantly on communication time. We have also added integrated support for the NVIDIA's Collective Communication Library called NICL for various MPI collectives. And the latest release also has full support for the NVIDIA DGX and DGX2 range of HPC systems. So with that, let me present the highlights of some Mbappage 2 GDR features for HPC, deep learning, machine learning, and data science. So we'll talk about what CUDAware MPI is, the support for AMD GPUs, the support for on-the-flight compression, optimized collective communication support for the BGX systems, and how all of these can benefit high performance deep learning, machine learning, and data science with BASC. Now, what is GPU-aware communication runtime? Now, if you take a look at it, what is your traditional model for using GPUs? So what happens is an application developer would copy some data from the CPU to the GPU, launch a kernel, once the kernel is complete, copy the data back to the CPU, and then move the data from a CPU to remote CPU. So again, from the CPU, copy data to the GPU, finish the compute, move it to the CPU, and move it out and do similar operations here. This, while is very simple to do, is not a high performance, and it adds additional overheads to the user. So what we are trying to see is this. Can the user just move the data from the CPU to the GPU, and then call an MPI operation from data that is resident on the GPU, and allow the MPI library to handle the data movement. So the application developer can call MPI send on a device buffer, that is a GPU buffer, and the communication runtime will take care of optimally staging the data from GPU to the remote GPU using a variety of mechanisms. So using this, we can achieve high performance and high productivity. Now, why is it high productivity? Because now the user does not have to specifically move the data after the compute happens from the GPU to the CPU. And why is it high performance? It is basically because of this. Modern HPC architectures are so complex that moving data from a GPU to remote GPU has 16 possible different parts, depending on the relative locations of the CPU, the GPU, and network adapter. For a typical application developer to be aware of the 16 different parts is just ridiculous, and they should not be forced to do it either. Your traditional application developer is a domain scientist, like a physicist, a chemist, or like a weather modeling person. So they should not be forced to understand the intricacies of underlying hardware architectures. That should be the domain of the communication library. So that is what we aim for, to provide a high performance and highly productive communication runtime to perform device to device communication, and that is Mvapage 2. So what's the big deal with all of these things? The big deal is this, with such high performance communication mechanisms, we can provide 1.85 microseconds communication performance from GPU on one physical compute node to a GPU resident on another physical compute node. So that is a factor of 10 improvement from naive data transfer mechanisms that regular users would have to use otherwise. And the same improvements extends to the bandwidth and bi-directional bandwidth operations as well. So this benefit is not only seen on one type of architecture like X86, the same benefits are seen on other architectures, like the open power architecture with NVLink 2 interconnecting the GPUs. So here are some performance numbers from an open power system with NVLink 2 and Volta B100 GPUs and two ports of EDR infiniment. So we get an intranode latency of around 0.76 microseconds and an internal latency of 2.18 microseconds and an intranode bandwidth of 65.48 gigabytes per second and a peak internal bandwidth of 23 gigabytes per second. So on this slide, we look at the performance of the latest NVIDIA A100 GPUs on AMD's Epic Process, the latest generation AMD's Epic Process. So here we are looking at two different metrics, intranode device-to-device or GPU-to-GPU point-to-point communication latency and bandwidth and internode as in between two compute nodes, device-to-device or GPU-to-GPU point-to-point you can see on bandwidth. And the system has eight 200 gigabits per second HDR, incident-band adapters from NVIDIA. So as we can see, the MAPH2 communication runtime is able to offer very low latencies and is able to saturate the network bandwidth quite well here. So far, we've been looking at the NVIDIA GPU ecosystem. So now we are looking at the AMD GPUs. So here we look at the performance that AMD GPUs or MAPH2 GDR can provide on systems with AMD GPUs. So this is your intranode and internode point-to-point latency. So as we can see, we have very good performance for within a node and across two different compute nodes for point-to-point operations and same is true for collective operations. So here we compare two different state-of-the-art communication runtimes MAPH2 GDR and OpenMPI with UCX support. So we can see that we are able to beat the competing runtime by fairly large orders of magnitude in the collective range and as well as the point-to-point range for the medium to large message data sizes. So now let me talk a little bit about on-the-flight compression and the corresponding support in GDR. So modern high-performance computing systems are computationally very dense. However, the amount of communication bandwidth from one node to like a different compute node is not that high. So this presents an ecosystem where you have islands of computation interconnected by rather narrow bridges or interconnects. So what we are trying to do is can we balance the system? So this system is inherently imbalanced with a lot of intra-node bandwidth and communication capability but very little inter-node communication bandwidth. So can we balance this out? Can we balance this out and use the additional compute capability on modern GPUs to compress data that is being sent over the wire on the fly with zero changes to the application and with no change to the data validation as in even though you are compressing the data validation is still fine. Can this be done? So this is what we have tried to achieve in the Mbappage to GDR communication runtime. And here we see the impact of these designs on the AWP, ODC, seismic or earthquake prediction or earthquake modeling application out of the San Diego Supercomputing Center. So these were presented in IPvps in 2021 and it was a best paper finalist. So the main thing is that we are able to significantly reduce the overall communication volume of communication thereby reducing the runtime per step. So we can see that we are able to increase the computing flops and the runtime per step for very large number of GPUs on modern HPC systems. So the support is again available with the latest release of Mbappage to GDR library which is really available from our website for download. In this slide, we look at the performance of collective operations on the DGX2-A100 systems. So these are the latest and greatest systems from NVIDIA for high performance computing, deep learning and machine learning for anything that has to do with GPUs. So it has a very rich communication architecture for intra-node and inter-node communications. So here we compare the performance of two of the most popular middlewares for GPU to GPU or NVIDIA GPU to GPU communication. One is Mbappage to GDR and the other is NVIDIA's collective communication library or NICL. So as we can see, Mbappage to GDR is able to outperform NVIDIA's NICL collective communication library for all the different collective operations that NICL support, like all gather, broadcast, reduce and all reduce for all sorts of message sizes and different number of nodes and processes per node. So you will see that Mbappage to GDR always performs equal to or better than NICL. Now let's see how one can use Mbappage to or let's say MPI-driven infrastructure for accelerating machine learning and deep learning training. So here we are looking at two different ways. So if you look at traditional or let's say typical MLDL applications, they use one of the few popular machine learning deep learning frameworks like TensorFlow, PyTorch or MXNet. Then in order to have, let's say good scalability, they use Horovot from Uber. Or if you are using PyTorch, then you have the additional flexibility of either using deep speed or PyTorch's own distributed communication run by. Now these distributed communication abstractions can be accelerated by using MPI-driven or let's say Mbappage to driven communication substrates. So these are the possibilities here. So you can either use Mbappage to or Mbappage to X for accelerated distributed high performance CPU based training or Mbappage to GDR for similarly GPU based training. So on any of these compute architectures. So what is the benefit of using advanced communication run times like Mbappage to GDR here? So in this slide, we look at how one can accelerate TensorFlow using data parallel distributed deep learning techniques on the Oak Ridge National Labs Summit Supercomputing System with almost 1500 GPUs. So here we are using ImageNet 1K with close to 1.2 million images. And we were able to see that Mbappage to GDR was able to reach close to 0, almost half a million images per second of training throughput for the ImageNet 1K benchmark. So if you take a look at this, we can potentially train an entire model using the ImageNet 1K data set in just four and a half minutes. So this is the kind of performance that you can get with Mbappage to GDR. Unfortunately, we were not able to go using a Nickel 2.6 to this scale because we had certain scaling issues with Nickel 2 beyond 384 GPUs on this HPC system. Now in the previous slide, we looked at how one can use Mbappage to GDR to accelerate deep learning training on a GPU based system. So now we look at the same on a CPU based system. So here we look at how you can accelerate distributed TensorFlow on Texas Advanced Computing Center's Frontera HP system on close to 2048 CPU nodes. So we can see that we get near linear scaling performance and we can potentially train ResNet 50 in just seven minutes. So that's some awesome scaling performance right there. So here we show some more numbers on how you can use PyTorch with HoraWord and DeepSpeed at scale for training ResNet 50 on 256 V100 GPUs. So the training performance on Lazen was close to 10,000 images per second faster than Nickel based. So here we look at multiple communication substrates that PyTorch can use like Torch distributed HoraWord and DeepSpeed. And we see that for all the different communication substrates Mbappage to GDR is able to significantly outperform NVIDIA's collective communication layer. Now this is a very interesting and important slide on how we are working with pathologists and other computational scientists to enable artificial intelligence driven digital pathology. So now if you look at it, there's a sample whole slide image of a tissue that typically is given to a pathologist for observation. Now a lot of computational pathologists have been trying to make this process of detecting, let's say anomalies in this whole slide image automated by using deep learning solutions. But the problem is this. Because of the restricted amount of memory on the GPUs, one cannot do this on smaller scales. So each WSI is close to 100,000 across 100,000 pixels. So it obviously cannot fit in a GPU memory. So we need advanced distributed deep learning techniques to make this a reality for pathologists. So this is where we worked with computational pathologists and real, let's say MDs in the field at the Ohio State University to accelerate their deep learning training which previously used to take almost 32 hours and we moved it to an HPC system and we were able to finish the entire operation in just 27 minutes. So this is an awesome scale up that we were able to obtain and there's an ongoing work between us and various other folks at the Ohio State University. Thanks Hari for going over the deep learning support in the WAPS2 libraries and how applications can really take advantage of the WAPS2GDR library for GPU based deep learning and also WAPS2X for the CPU based deep learning. So let me now move forward and try to focus on how you can actually accelerate machine learning applications using WAPS2GDR. So what we have done as you can see here especially the WAPS2GDR in the context of like a for machine learning this is the new architecture we have proposed. So we have here MPI for Pi and what we have done is the WAPS2GDR being included here so which can actually run on top of CUDA and it also has some the overall stack also has support for UCX and Nickel etc. So here we have actually done optimization of the collectives and tight integration with the QML. We have made a release earlier MPI for QML. You can actually visit our website which is called HIDL, High Performance Deep Learning which we introduced earlier and the details on the papers are also presented here last year in the ML SBC workshop and these are some benchmarks traditional benchmarks you can see being used by the machine learning community K-means, linear regression, nearest neighbors CompitPed, SVD etc. and now you can see we are comparing if the support is available with only Nickel that is the NVIDIA collective communication library compared to that if you use our MAPS2GDR library you see how much speed up you can get. So here you can see this is with respect to training time on the left side and the speed up. You can get a speed up for here like a 1.6 here also nearest neighbor you can get 1.6, linear regression is around 1.25 a truncated SVT around 1.4. So it shows that using the MAPS2GDR library with our MPI for QML package you should be able to extract in higher performance for your machine learning application. Similar things recently we have done for the DASC this is for data science applications again this is the overall architecture the DASC architecture it has a lot of different components so what we have done is introduce this MPI for DASC and that goes through MPI for Pi with our MAPS2GDR and the DASC architecture currently supports TCPIP that is like for the regular network it also has support for UCS and these are the boxes which are in yellow have been introduced by us now if you if you use our software stack we made two releases so MPI for DASC 0.1 and 0.2 again this is available from our high performance big data website if you are interested please download it there is a very detailed user you should be able to follow that and be able to run your data science applications so here we have like a first benchmark sum of QPI RA and it's transpose is running on our local OSU cluster so here you can see compared to like the IP over IV that is the TCPIP interface and the UCX interface running over the infine man if you try to do MPI for DASC you are able to really reduce the execution time here we are able to give almost like a 3.47x better performance on average and on the right hand side we are able to reduce like the communication time by almost factor of 6.92x and then we had this paper published in the CCGrid 21 if you are interested please feel free to take a look at it similarly we have done the benchmark 2 this is a QDF merge and the TAC frontera GPU subsystem here again you will see very similar pain the total execution time on the left side and this the merge throughput throughput higher is the better so here you can see on the right hand side we are able to deliver almost like 2.9x better compared to the other solutions and the left hand side the total execution time we are able to also trying to reduce by a similar factor of 9.1x so while we are working this MF2 GDR is going through a very exciting phase so we are gradually adding more and more features here both for HPC as well as DL so I will try to highlight 3 things here on the fly compression for all 12 collective you heard about some basic compression scheme earlier we are trying to advance it for all 12 collectives scalable distributed training with model hybrid parallelism for out of code DNN models single image super resolution so here we have an excuse me we have an advanced design which is all 12 operation itself time to take advantage of the on the fly compression earlier you saw that we have point to point like the on the fly compression is happening on the point to point communication and of course one can build all 12 on top of that but here internally we are able to while the all 12 operation is taking place we are able to actually do the and especially here if you see GFE OPT with rate 4 we are able to reduce the latency almost by 87 percent this is for a 16 megabytes data on the front RTX and also on the right hand side is on the long on V100 here you get almost like 87 percent benefit and as some of you might be knowing there is a the next generation deep learning models especially the recommendation models are trying to utilize all 12 we will be planning to make this release in the near future and you should be able to really see a significant speed up for this recommendation model so for the so that is for the like the recommendation models but now if you see like the regular deep learning thing we are trying to accelerate also transfer more models and especially here we have introduced something called the sub graph parallelism we had a paper published at the IPTPS at 2021 so here you can see that the traditional scheme people have been using data parallel design compared to that if we now use the data and sub graph parallelism that is our new hybrid parallelism D&SP here you can see we try to run it on 1024 GPUs and here we get almost like a 3.5x speed up over the data parallelism itself so this is like a the red line parallelism this is like the 2A 4 way and 8 way so as you can see with a different kind of parallelism we are able to even give better performance and scalability similarly there is a lot of focus these days on a if you take a very large image single image with a lot of resolution like super resolution how do you do that thing so here what we have done is we have done a very thorough analysis with our MPI driven solution and we have done some optimization that is on this column called MPI OPT and we are comparing with the nickel and as you can see that compared to the to the default we are able to significantly improve almost factor of like a 26.33 and compared to nickel also it tries to deliver even better performance so this paper was actually published at the Intel workshop in the connection with IPDPS 21 again please feel free to take a look as we are moving earlier I indicated we have these libraries available to a lot of distros and recently we have been focusing a lot of the distribution through SPAC because that is getting a lot of momentum so you can actually do very easy installation of our MAPS2 libraries through SPAC we have the support of the all the major libraries MAPS2, MAPS2x and MAPS2 GDR we have a detailed SPAC based installation user guide please feel free to take a look at this and will exactly tell you the step by step to install our libraries on your system and once you install these as we indicated through the earlier examples you should be able to utilize these for your HPC machine learning, deep learning and data science applications so with this kind of the let me try to conclude the presentation here we have been continuously innovating this project over the last 20 years as you saw our next goal is to really go for the very large scale exascale systems which are coming up so here we are aiming for performance and memory scalability towards almost 10 million cost and efficient support for the hybrid programming models optimizations for GPU support and accelerators more and more GPUs like for example Intel GPUs are coming new kinds of APJs are coming and also the networks are also providing more and more features like the tag matching adapter memory, Intel Optane architecture for the high bandwidth memory, traffic interface so these are some more power on our roadmap we will be adding more and more advanced features with respect to our MAPS to release and then we will be pushing this further so in the end I would like to extend acknowledgements to all our sponsors here is a list of all our sponsors the funding agencies not only we have a lot of support from major national funding agencies like National Science Foundation Office of Science we also have a lot of funding support from industry these are all listed here and with generous funding from all of the industry and also a lot of equipment donations we are able to actually carry out our research and make this project continuously trying to sustain over the last 20 years but last but not the least these are all our heroes as you can see we are trying to summarize results of our project so a lot of students staff have actually come to Ohio State to join my group and contribute to these projects and then we have been trying to build on top of each other work so every time I try to present I would like to really salute all these heroes so with this let me conclude the presentation here if you have any questions please feel free to send us an email we will also be online during this actual presentation time we will be able to answer questions directly online or if you view this video later on please feel free to send us questions using any of these email addresses and we will be very happy to answer the questions so with this let me stop here I hope you like the conference and also if you have not used the rubbish to libraries please feel free to use these and you will be able to extract higher performance and scalability for all your applications thank you