 It is Anton from CSCS, is a software developer that are working with the community and is very much in max activity and it would present you a different approach to the user and the user of Accelerator computing than domain specific library. I would like to thank everyone for giving me the opportunity to explain our vision of CSCS on the code, on porting codes to GPUs and following from the description of the talk it's about using the libraries. As you know we are CSCS, we operate one of the largest supercomputers in the world and it's the top one supercomputer in Europe with more than 500 or 5000 nodes equipped with P100 with a big performance of roughly 20 beta flops. So it's a large installation and if you look at the node design, so it's very easy as Massimiano mentioned. So it consists of a CPU which is connected to a memory and a GPU which is connected to its memory, very fast memory and so the problem is here. So there is a very slow connection between the CPU and GPU and of course we want our users to fully utilize the GPUs because that's the main source of compute power so the card gives roughly four teraflops of performance. And unfortunately with GPU cards it's not easy as Calvo said today there is no magic silver bullet so you cannot just click your finger, recompile the code and run it on the GPU so you have to work. This is, we realize that in any codes that are going through the face of refactoring and porting to GPUs, users or developers, they have to do several steps. So they have to clean up the code, they have to rearrange sometimes the modules, maybe or may not they change the layout, it depends on the code. So they have to prepare for the single node so they have to squeeze the maximum from the CPUs first and then when they realize, okay, so we have a bottleneck in CPU computation so then we can start moving things to GPUs. So it's a long, long pass and if we talk about codes that have hundreds of thousands of photon code, it can become a challenge. And this is one challenge, the second challenge is when you are at the level of porting. So okay, the people come to Hackathon so they are ready to port their code and so when there they have a possibility to use a CUDA Fortran or CUDA C which is maybe the most, you get the most control of the card here but you need to have an expertise on CUDA or you can have a possibility to use OpenCL, OpenACC or OpenMP technologies and these two are pragma based so which are easier to use because they have, so you don't have the full control over the cards or what you do. OpenCL is, it gives you performance portability but I think OpenCL is not supported by NVIDIA at least there is 1.1 version of, and the current version is OpenCL 2.2 I think. So there are issues with all these things and users get ticked in the panic mode so say okay, we don't want to deal with this, please don't, let's stay with CPUs and this is one reason and the second reason why we still have this because GPU systems are in production for maybe six years already and our electronic structure codes they should benefit a lot from the GPUs because we are doing a lot of linear algebra and GPUs are designed to do a lot of dense linear algebra stuff and still this does not happen I think because if we draw this relation between users, code developers and compute scientists and between supercomputer centers and codes we can draw this as green lines I think these lines are green which means that users have nice relations with compute scientists so we answer tickets we explain how to run codes users have nice relations with code developers code developers know the code compute scientists know the computer and that's all good and the rest I think can be green can be orange or can be red so there is sometimes there is a good relation here sometimes not and sometimes compute scientists cannot get the full codes because it's thousands of lines or four turns so you cannot decipher it easily and codes or modern architectures do not run efficiently or not always run efficiently and that's I think that's we think that's an issue and to give some example not related to condensed matter to our stuff but at CSCS we had a good experience with MetuSWIS code which was the same it was hundreds of thousands of lines of Fortron code and it was the same problem that compute scientists could not understand the full code the full code couldn't run on the GPUs and the code developers MetuSWIS they could not afford to spend their time to port the code to GPUs and so what they did there so they separate the code into two pieces they first split it into two pieces one was doing stencil operations one was doing physics and the physics part was a huge piece of code but it did not consume too much computer resources and the physics part they kept on Fortron with open SCC pragmas and the stencil part which was 90% compute of compute time was spent in the stencil operations the people ported to GPUs with a CUDA with very highly optimized backends and that allowed MetuSWIS to switch from a CPU production code to a GPU production code and to increase the density of the grid and to do forecasts in a much better way and Switzerland is not a flat country so they need to have a very precise forecast so coming back to our turning strategy codes we are dealing with I think most of the turning strategy codes we can fit into this table so where we treat our basis as a periodic block function so plane waves or similar stuff or localized orbitals and so when we have a full potential or pseudo-potential treatment and so these are mainly the codes which and so we are interested in plane wave codes where historically we are interested in this segment where for example BASP is used in the United States to run a materials project heavily and Quantum Express is used in Marvel project to generate a lot of results and Quantum Express is a super popular code so the green are the codes that are fully or partially GPU enabled and it's after six years some of the codes have the GPUs there and some of the codes I can probably highlight Quantum Express so it's intermediate states of the transition but some of the codes are still not GPU enabled which is a sad news so why we are interested in these things because they are fast and so why we are interested in these codes they are precise if you remember there is a Delta DFT effort so that tries to compare the results of total energies and so they take win2k it's a full potential LAPW code as a reference so this class of codes are slow but they deliver the best precision in a couple of slides what is the pseudo potential the pseudo potential code is the code that works with regular grid so they map the unit cell to regular grid all the functions are expanded on plane waves and the potential is replaced by pseudo potential so the basis is a plane wave basis and the expansion is in plane waves so what is good about pseudo potential codes it's what is not so good so there is approximation to atomic potential so you have to be careful how you pick your pseudo potential you don't have core states so if you want to do core spectroscopy you have to be careful there so the number of basis functions is typically is 1000 pyrratum and the number of values states is roughly hundreds of a percent which means you can build effective iterative subspace denials schemes for pseudo potential codes also which is very nice and it is very good that you can easily compute atomic forces and stress so you can easily do molecular dynamics with pseudo potential code the full potential LETW code like FLIR which is part of the max project of VIN2K or exciting so they what they do is they partition space into non-overlapping spheres and inside spheres they will use a spherical expansion and outside of spheres they will use a plane wave basis so it's a plane wave inside the sphere it's some linear combination of atomic solutions and matching coefficients that match your plane wave in the sphere boundary and the same for density and potential so there is no approximation to potential potential is exact and that's why it's called full potential and so this is good so there are the core states you can explicitly put all the core states into your basis so you can do all the core spectroscopy typically the basis number of basis functions is 100 per atom typically which means it's a much less smaller basis size number of valence states is larger of course and the problem is that overlap matrix has a very bad condition number so it's a very highly conditioned number so you cannot build easily the iterative scheme so you have to do a full deganization so that's the problem of full potential codes or LEPW codes that you have to diagonalize or Hamiltonian matrix explicitly you can easily compute forces you can move items around pretty okay but you cannot compute stress easily I think FLIR may be the one of the codes that compute stress analytically otherwise you have to do numerical schemes so you have to do 5 or 4 point interpolation then you have to compute several ground states to compute the derivative of the total energy versus strain so that's why full potential code cannot be easily used to do molecular dynamics but it can be at least used to verify your your results and so when we analyze these two classes of codes we discover that there are a lot of things that are very common to both of the codes for FFTs and plane wave definition and all the indexing of the localized functions and symmetrization base ground coefficients and stuff which is common and so when we talk about porting codes to GPUs we decided okay there is a lot of common stuff here and if you want to separate the concerns let's do it to the library not directly on the series or not directly on the exciting code or directly on the quantum espresso code and so that's why the series library was born in order to help in order to help the quantum espresso and exciting codes to get access to GPUs so what you want to do we want to just to modify the codes a little bit not as much as you want to do with Scuder Fortran but you want to have a small impact on the codes as possible so we want to inject we want to inject some codes to an API that will allow to run us the codes on GPUs without heavy modification and the question is where to draw the lines so what we can put in this box so what goes into a library or what stays here actually everything that original code has it stays here that's not a problem we just have a switch that will enable the GPU execution through the series library and at this stage we are we decided we want to encapsulate the DFT ground state so not more not less and so we can talk about this as a quantum engine so you can easily create a quantum engine based on this library so you can ask a library please solve me the ground state and it will return you back the way functions called density so potential so there are several things are implemented for full potential part and pseudo potential part so we end process of improving the functionalities so we want to add LDA plus U to the pseudo potential part of the code and the same for full potential part of the code so we want to do better API with quantum espresso the code itself is a set of classes so we built we want to build a DFT like objects that represents our DFT cycle so we have a set of classes and we start from simple cases of simple classes like representation of G vectors or representation of the FFT driver or wave functions and then we propagate to the top where we have super classes that solve a problem or the DFT ground state as a whole so we also like to stress that we try to write a documentation so not only what is implemented but it is always possible but we try to write how and why the things are coded in such a way and so try to give sometimes a small mathematical script that you can run in a notebook and reproduce the formulas that are coded and the code is documented with doxygen and so doxygen allows you to browse, navigate and use the complex diagrams of the code so the way we now interact with quantum espresso is one way we do we wait for the updates on the main quantum espresso forts and so once these updates are pushed to the quantum espresso forts so we fetch them from time to time to our branch and so we we mirror the quantum espresso code and we maintain our branch with a patched version where we pull our changes and so we are always synchronized with the main quantum espresso trunk so it's every, if a new feature here so it will very soon be, it will soon be, it will appear here so because we do an interface we have to be a little bit careful how we initialize stuff so because quantum espresso needs to start up itself it prepares a lot of things, it reads the input file it prepares the beta projectors and radial functions so we have to wait for it and once the quantum espresso done with insertization so we can initialize series and so then we can start setting and fetching back the objects and for the ground state currently we don't encapsulate the whole ground state so we split it into pieces so for example the series solves the bend problem so the most heavy part and it generates the density while the symmetrization and mixing and generation of effective potential it happens on the quantum espresso side so that's why in this case we are not responsible for symmetrization or for exchange correlation potential a couple of examples so one is this variable serialization of diluted of silicon germanium alloy or germanium which is as Simeon explained it was done on our piece-dined machine on one, two, five and ten nodes because the number of k-points is ten so you can fit it perfectly into one, two and five or ten nodes and so we took our hybrid nodes hybrid nodes and possible nodes and knl nodes and this is the the baseline of the normal quantum espresso that's the series-enabled quantum espresso on CPUs and the result is more or less the same the small variation because the number of operations is more or less the same that's the knl version and knl is in this example it's more or less the same so knl is not faster actually while in theory it is two times more flops than a Broadwell node so in reality knl works as a Broadwell node and that's the GPU result so GPU code is of course faster than a Broadwell node which is good the second example is a bit larger so it's a 288th atom unit cell of platinum cluster in water and it was run on the same setup of p100 nodes of Haswell, of Broadwell nodes and knl nodes and that's the number of k-points here is two and so the parallelization was used and that's why the good numbers for the nodes is two times nine, two times 16 and two times 25 to build a square grid and that's what normal QE is doing that's the QE with Sirius library that's the QE on the knl and here knl is worse than the Broadwell node and that's the QE with Sirius with GPU so it's again it's roughly two times faster than the Broadwell node and so the last example is for exciting code so exciting is a full potential code and so for this example so we did again, we compiled exciting with Sirius and so for the test we picked a 96th atom unit cell of manganese base organic, metal organic framework so this system has 24 k-points and so that's why it's perfectly mapped on 12 or 24 or 6 nodes and so we try to run it on on hybrid nodes and on CPUs and so that would be that would be the result of the normal exciting code on CPU partitions so because there are 24 k-points the canonical exciting code at maximum can have can run on 24 ranks that's the maximum they can do and I can put this ranks a stupor node so it's one rank per socket with 18 cores one rank per node and for normal exciting code I would get this so because of the because one rank on 36 cores which are in two sockets it doesn't work very well so that's why the exciting user will get his best time to solution he will get on 12 nodes it will take him around 7 hours, 6 hours so it's in minutes so it will take him 7 hours to get time to solution that's the best the exciting user can get so if he will use the same setup but with Sirius he will get faster result but Sirius provides a full parallel solver so the user can also go parallel in terms of the organization so he can grab more nodes and so on a CPU partition with 260 nodes the user can get 44 minutes to solution which is where this was 6 hours to solution which is already better and if we trigger the GPU code so the GPU code is roughly 7 times faster and so we use magma for the organization so the best time to solution for this case will be on 24 nodes and 26 minutes so for LAPW it makes a lot of sense to use GPUs because the magma library which is doing the organization is amazing here so it's a matrix of 16,000 this diagonalize maybe in 50 seconds it's incredible so GPUs are good and that's our message so thank you for your attention