 I'm dealing with benchmarking and reporting of the user community application, so. Thank you. So today I'm going to talk about the reporting we did of Quantum Espresso for GPU system. The work was done by myself, Everett and Josh that are with me and from Filippo that is now at ARM previously at University of Cambridge. So a quick outline, some introduction and objectives, what is Quantum Espresso you're probably more familiar than me about this and then some details on the implementation and then a lot of benchmarks and results. So first of all, the three of us that did all the work, we are outer space engineers, so we know very little about material science. But I was told that material science use principle simulations a lot to study the physical properties of materials. And when you want to study a large system or you want to reduce the time to solution, you need to use a high performance computing system. If you look at what's happening in the last six, 10 years in HPC there are now a lot of new technologies. Among all these new technologies, GPU are now very popular. There are mainly three reasons. They are many core processor and when I say many cores, we're talking about thousands of cores. Very simple cores that can give you a very high flow rate. Volta, the last generation, can perform DGM at almost seven teraflops. And you also have a lot of memory bandwidth. It's HPM2, it's stuck on the GPU and you can achieve more than 800 gigabyte per second. So you're talking about very interesting numbers. Also looking at exoscale, GPUs are very energy efficient. If you look at the last three, four, green 500, most of the time, all the top 15 system are all GPUs. So you can deliver right now almost 15, 16 gigabyte per watt. And together with the hardware, there is also a lot of software. If you want to port something like quantum espresso, you need math libraries, you need compilers, you want to have profilers and the bugger and so on. So there are several compilers you can use to program GPU. You can use programming languages like CUDA or like in this case CUDA Fortran. You can use compilers that are with directives like OpenACC. NVIDIA itself provided a bunch of math libraries but there are also open source projects like Magma that has a very nice collection of math libraries. And then there are profilers and the bugger, even like commercial one like TotalView and Aline DDTA. So we decided to port quantum espresso to the GPU. Quantum espresso, it's a huge code that are so many different things that can do. So at the beginning, we focus on PWSFC that is the main workhorse. We use CUDA Fortran but we have a single source code that can be compiled for CPU and GPU. We use extensively kernel loop directives. Basically, when you have a nested loop, you can just throw a directive in front and the compiler will generate for you. We also spend a lot of time doing validation and performance study on a large variety of system, both x86 base and also power base. In particular, we use a lot of machine that is now a Chinega called a Davide. And I think Luca will talk about some other things. And all the software we ported, it's all open source. Right now, it's on a GitHub repo from Filippo but one of the objective was to have to show best practice. We did a lot of very detailed optimization and the other code can look at what we have done. So, as I told you before, you're probably more familiar than me with Quantum Espresso. It's a popular package that is widely used in the academic industry. We ported the PWSFCF, so this is computing the quantum orbitals and the energy of the material system. And it's using an iterative methods that is looking for a self-consistent input and output charge densities. So, if you look at what needs to be done, basically in each iteration, you need to diagonalize an Hamiltonian. This is done using a block Davidson methods and this is done for each of the K-points, the KS orbital. And then you have a computation of the output charge density using the results of the diagonalization and you do this multiple time until you reach the threshold that you decide. So, what we've done, we can use different parallelization strategies. These are the ones that are already in Quantum Espresso. So, you can parallelize across K-points. So, you distribute K-points into a pulse of processor. And then, only for the CPU version, you can actually use parallel libraries to do the block Davidson diagonalization. And this is done either with Scalapac or with Elpa. So, there have been in the past a plug-in done in CUDA-C that was done by Philippon and Ivan for Quantum Espresso. We decided to write everything from scratch using CUDA Fortran. The code is in Fortran. It's very easy to use CUDA Fortran and in this case, you can also have a single code base. We decided to use CUDA Fortran. You have more control than OpenACC. You still have some high-level programming directives and it's easier to maintain the dynamics language approach. CUDA Fortran require PGI compilers that in the past was a Quantum Espresso, it's open source and it will be nice to have an open source compiler. But now PGI has a free community edition, so you need to buy the compiler, you can just download it and use it to compile and run Quantum Espresso. So, at the beginning I was telling you that there are a lot of tools and profiling is very important when you're trying to get the best performance out of the system. And when you're dealing with CPU and GPU, you have more complexity than usual because you need to understand the interaction between the CPU and the GPU. I'm getting the right bandwidth from the PCI Espresso or the MV link. Can I add data moments between the GPU and the CPU? And then sometime also, you may find some spot in the profiler that you are not expecting. So, in particular, we heavily use these three tools that are coming with CUDA. MV prof can be used to generate kernel properties and time information. You generate a trace and then you can import in this MVVP and then you can also add these markers to the timeline. And what we did with Quantum Espresso, Quantum Espresso is a nice timing system already in place. So we actually modified the timing system to inject all the MVTX markers that we needed. In this way, you don't need to add extra code to Quantum Espresso. You just use the normal start time and time and everything is done for you. And this is just an example of the show This is a, actually, when you go on each one of this bar, there will be the name and you can find all the information about what is running. But in this situation, you can see, for example, that you have a bunch of copy back and forth. These are probably the FFTs and then there is some other things going on. And so the idea is that you start working and then you start attacking the larger bar and then trying to reduce the run time and then move to the next one and so on. So if you look at what are the main operation done in Quantum Espresso, you basically have three operation types. You have a lot of gem. Usually they are Z gem, level three blasts. You have a lot of FFT that are typically distributed. And then you have the dense matrix diagonalization that is done using either LaPak or ScalaPak or Elpa. So for the first one, the blast routine, they're very easy to port. We have a Kublas library. It's basically, you just need to link the direct library. The 3D FFT and the dense matrix diagonalization are required more work. And then once you start porting everything to the GPU, even something that in, at the beginning was not a bottleneck, at a certain point becomes a bottleneck. When you go very fast on all the other things, competition of stresses or forces becomes a real bottleneck. So at the end, we move all the code to the GPU. Basically once you start the CPU, it's utilized only in one part of the Eigen solver. I'll show you later. So for the 3D FFT, we use the 1D FFT that are coming from the NVIDIA library. They require a lot of transposition data communication. And in the current version, there are a lot of MPI all to all. And then you have many 3D FFT computation for each K point and one for each band index. So one thing we did, when you're dealing with the system with CPU and GPU, you want to overlap as much as possible computation and communication. And if you're using something like MPI all to all, that is difficult. So we change the scheme so that we can do more overlap. And so you can do some kind of pipelining of data movement. And so all 12 communication are now done with the non-blocking MPI as sent and I receive. And then there are, we run on a large variety of system and depending on the vendor, it's a crime machine, it's an IBM machine, it's a normal X86. You may have, there are things called CUDAware MPI and sometimes they are not very efficient. The implementation from that particular vendor is not that good. So we did a lot of tuning, for example, to use IPC interpreter communication to do, to get the best performance you can get. Now, when we start looking at the diagonalization, there is a library that is coming from Nvidia but the performance was very poor. There is also another library called Magma, the one coming from Jack Dongar's group. But they also rely heavily on the CPU and when you have a system with a lot of GPUs, you don't have a lot of CPU resources to use. So we brought a new, I can solve it from scratch. And so right now it's only working for a single GPU but it's pretty fast. So we usually are much faster than NKL and in most case we also beat Alpa. This is an example of, so this is NKL on 16 cores as well, this is Magma, this was the old library and this is the new library we did. So you can get a decent speedup and I'll show you later on on the results that we actually are usually faster than Alpa even using a very high end system with a lot of cores. So going to the benchmarking results. So for this presentation, we are going to talk about three different benchmarks but we run a lot of them to be sure that we are getting the right results and also to find the hotspot in the code. The reference system is a system that Philippos has access to. It's a very high end cluster with like two top of the line Broadwell with like 18 or even more cores each. And when he did the run, he really tried out all the possible combination of Alpa, NKL, changing the number of OpenMP and MPI thread. So when you will see the results, each one of the points is actually the best configuration he found out of 50 for each one of the run. The GPU system, the Eigen solver is always serial. We use kublas for plus and then we use NKL or ESSL depending on the host for the CPU routine. And then on certain system we use to improve the bandwidth between GPU. We use this Kudahware MPI and custom IPC. Now on OpenMP, on Intel system we also enable OpenMP. You cannot do that on IBM system because the OpenMP is incompatible with the multi-threaded ESSL. So you have to decide either using the IBM ESSL multi-threaded or using OpenMP in the rest of the code. And we found out that using the multi-threaded was faster. So all the rest is not used in the stuff. Just to give you an idea of how different this system can be. So for example over here, you have Pitstain, the biggest machine in Europe. It's a very simple design. You have one CPU, one GPU in this case it's a Pascal and this is a Cray Aries network. This is a machine at Cambridge, Wilkes II. You have one CPU and then you have four GPUs connected through a PLX switch. You can see over here that this is a real bottleneck. You need to work around it. This is an NVIDIA DGX1, it's a very dense system. You have eight GPUs, you have four NICs. You have this MBLINK, it's a new technology that is a very fast bus. And this is the machine that it's either both at Cineca, it's called a Davide and there is also a small machine at Oak Ridge. This is a power system. So you have two power eight GPUs in this case, two Pascal, power as a MBLINK connection to the CPU. So you can move data very fast. It's more than twice as fast as PCI Express and then you have PLX and switches. So we really try to cover a wide range of configuration. So the benchmark case we use is, the first one is a very popular one is coming from, I think, Brace. It's a golden surface with 112 atoms and 2K points. The other one is tantalum pentoxide and this one has a lot of K-points, 26. And then we also add to compare against KNL and also Sirius, the presentation will come after me, this silicon germanium something. And this is a VC-Relax. So in this computation, you do a bunch of SCF simulation, you compute force and stresses, you modify the geometry and you keep iterating. So on the top of the line, you can see the result from Philippo's effort. And over here, you can see a bunch of, this is pitstained, this is the NVIDIA-DGX. And when you see GDR, means that we took advantage, for example, of this, in this case of NB-Link to communicate. And also Wilkes too, we try to avoid going to the CPU but we try to stay as much as possible on the other side of the PLX. So you can see that in this case, usually increasing the number of pools give you a better speedup. You can see you almost have a linear speedup on both simulation. When you just increase the number of resources for with a fixed number of pool, the scaling is less. But nevertheless, you can see, you have AGPUs that are faster than 32 sockets. So this is 16 nodes and this is a single machine. So you can get probably a factor of two, three speedup depending on the system. We actually improve a little bit more the code but we don't have time to rerun on all these possible configurations. So we just fix, for example, the AGPU, that means the AGPU, ACPU case. So over here you can see FFT, Eigen Solver and the other. So you can see that for this particular case, our single Eigen Solver is much faster than ELPA. The FFT performance are improving using GDR. You can see over here, the FFT can shrink if you can use this extra bandwit on the GPU. The Eigen Solver on Summit Dev and Davide is slower. One of the reasons is that ESSL is not exposing an internal library that we need. They only have a very high level. We need a solver for a triangular system. Three diagonal system. This is for TA 205, you see a similar number. In this case, you have a large number of K-points so you can scale all the way up to 128 GPUs. And also over here you can see that usually increasing the number of pools give you a better speedup, keeping the resource constant, the number of pool constant and increasing the resource will give you less. Like before, let's just focus on the 104 GPU. So in this case, for example, we also have number for Volta. So you can see that basically without changing your code, what used to be in the Intel world 15 years ago, new CPU, new boot speedup is now happening on the GPU. So you go from 290 down to like 197. For example, Summit Dev and Summit, because they have also device NB link, they actually have faster performance for the FFTs, but they are slower on the Eigen solver. Moving to the last case, this is the VC Relax. So we use this case also to show accuracy. So in this, this is a repo from Pietro and Anton. So they have a large case of benchmarks and they have data from, for example, running the standard quantum express on the Broadwell node at CCS. So these are 10 nodes and these are 10 nodes of Pistate. And then there are also some number from Chinaka using KNL. So you can see that 10 P100 are more than twice as fast, almost three times the performance. So this is the same number of nodes in Cray and they are much faster than KNL. And if you look at the energy, you can see that the results are pretty much spot on. What the firm in the total energy are the same, total force, total stress and pressure. And over here, there is a breakdown to see exactly which part is going faster. We also did some comparison with Seedust GPU. Seedust also is pretty fast, but our implementation right now, it's a little faster, this should be a 10. So this is exactly the same situation. And once again, I did a quick test on a Volta and you can see that basically, the code is going 50% faster just putting a new GPU. So conclusion, the new GPU implementation can give you a speed up of a factor of two and three. And it's also, you can go to access scale, but you can also bring a lot of competition of power to your workstation. You can have a workstation with four GPUs and you can be as effective as with a small cluster and you don't have to manage the power and so on. The code is running on both x86 and power and if there is compiler support for any other architecture, it will work. The custom Seedust GPU is pretty competitive and with all the other things out there, it's all open source so you can go to GitHub and download it. I think Anton was trying to implement it in Seedust. Depending on the topology of the system, sometimes you need to do a little bit more of extra work to do like P2P and other things. And everything is open source and available on this GitHub. I know that there are plans to actually merge with the main trunk. Right now, this code is only working for Quantum Express of 6.1. We didn't put in all the change for 6.2, but it's pretty easy. I believe. Thank you almost for your help.