 OK, next up is tell us something about OpenACC and GCC and how that helps with very much the carry-on of GCC. OK, thanks for joining me here in this presentation. I guess the first compiler talk today in here and the last probably. But anyway, and I hope not everybody is asleep yet and then the second day of FOSSTEM afternoon. But I have a lot of stuff to present here, so that should keep you awake, I hope. So this project of adding OpenACC support to GCC is something that I've been working on for a little bit more than five years now in our group at the company, a handful of people. So a short introduction agenda, which is basically what we've read on the talk list. So GCC doesn't need much introduction, I guess. OpenACC I will introduce briefly. Of course, in this 25 minutes, everything I can just introduce briefly. So that would be a lot more to talk about maybe in some later FOSSTEM instances. Who knows? OK, and then I will show briefly the implementation status, some examples, some performance results, and a live demo at the end. And of course, live demos, you always hope that everything will work out. But yeah. OK, so GCC, the GNU compiler collection. Just put the link there. Louder, OK. I will try to speak up louder. So that's a quote from Wikipedia. GNU compiler collection, compiler system produced by the GNU project, supporting various programming languages, key component of the GNU tool chain, and standard compiler for most Unix-like operating system. Of course, there's now also LLVM and research compilers and some proprietary compilers. But GCC still plays a very big role in what's being used, even for high performance computing, which you might assume that maybe some of the proprietary vendor compilers play a bigger role. So I'm not reading through all of this. This is just some motivation. Up there, you see a survey that has been done a few years ago from some high performance compute center about the usage of compilers on that system, what users are actually using there. And the GNU compilers, G++ for C++ GCC and G Fortran, make about more than 70% of all the compiler invocations there. Of course, that's just one data point and other data centers. There will be other results when other systems are used and so on. But that's just for motivation, why we are still talking about GCC, that old dinosaur which is just not going away. So then we're adding OpenACC support. OpenACC is to use the compute power that you have in accelerator devices such as GPUs. So I should briefly cover the GPU architecture. That's a block diagram of a years old, five or six years, I think, NVIDIA GPU Kepler K20. Of course it's old, but the general concepts are still the same now and also for other GPU vendors such as AMD for example. I'm talking a lot about NVIDIA GPUs here because that's what we initially did this project with. So that's what I'm personally most familiar with. So you'll see this 15 or something blocks here which they call streaming multiprocessors and in each of them there are yet more smaller blocks. So the green ones here are your actual compute cores. So that's basically processors that can do simple arithmetic instructions, memory access and that kind of things. Each of these bigger blocks has its own private data and of course there's a big GPU global memory space of several gigabytes or whatever you have. And then in one system you can have several GPUs and then you can construct huge systems which are then used for weather forecast computations and simulations, stuff like that. And our job now is to take some big computation that we have in CEC++ Fortran in this case and somehow map all the loop iterations that you have mapped onto all these many compute cores to keep them busy and harness the compute power that you got there. More abstractly, the GPUs look like this. You have several multiprocessors in each of them. You have some shared memory, local memory, whatever you call it. Then you have a huge register set and these several processors in there. Then OpenACC, that's an open standard. How many of you are familiar with OpenACC in any way? Raise your hands, please. Okay, a bunch. And how many with OpenMP? Okay, a lot more. Which makes sense because OpenMP has been around much longer. So OpenACC is conceptually and also mostly semantically or builds on OpenMP. It also uses the directives that you're seeing in OpenMP. And the principle is the same. You take your existing source code, your Fortran 77 code from the 70s, add some pragmas, some directives in there and can run that code as you did before, but it will just run much faster. That's what I say here to mark up regions for parallel vector loops. You have to do some memory management. As I said, GPUs typically have separate memory spaces from the host from the CPU. So you have to copy your data back and forth. You have some more special things, reduction operations and a lot more, of course. And all these directives and stuff are hints to the compiler or instructions to the compiler how to apply your source code to these many compute cores that you have on the GPU or another accelerator. So the language tries to be abstract enough or OpenACC is not a language, it's an extension to C, C++ and Fortran. And it tries to be abstract enough so that it applies to basically all the parallel accelerators that are out there. So one example, quick one matrix multiplication, you may have seen that before. And of course, that's just a simple example. So that's your original serial code, which works, which has been tested for decades. Then you have several compute constructs available in OpenACC. One is the parallel construct. So the red lines are what I'm adding here to run this with OpenACC parallel construct. In this example, it's a lot that I have to add, but this is just showing the one hot loop. All the rest of your existing program could not be touched by the OpenACC annotations. So just going over this briefly, you start with pragmaACC parallel, which says that the following region, so a structured block in C, C++, is to be offloaded to some accelerator and something is to be run in parallel in there. And then you have all these three pragmaACC loop nested inside each other, according to the for loops that you have. And that's because in OpenACC you have several levels of parallelism which map to the several building blocks that you have in the GPU architecture. So for example, the outer gang level would map to queuing these computations to several of the bigger compute blocks that you have on the GPU. The most inner one, the vector loop is just a, or similar to a CPU vector, just that the vector width is much bigger. So in NVIDIA GPUs, for example, it's 32 size vectors. And in between you have the worker parallelism level, which is essentially a group of vectors. Okay, so that's a lot you have to add in this example, but again, this is just showing the one hot computation loop that you have in your program. All the rest will just stay as it is before. All right, and you have the data copy clauses up there. Copy in A and B arrays and copy out the C array. Obviously C is where the results are being stored that doesn't need to be copied to the GPU because everything will be overwritten. And A and B arrays are only read inside this region so they don't need to be copied back from the GPU to the host after this loop executes. Then we have the open ACC kernels construct, which is an alternative compute construct to the parallel construct. You'll see that here I have not put in all this ACC loop directives, you don't need them in the kernels construct. Here it's the job of the compiler to figure out which loops can be run in parallel, how they should be parallelized, so which level of gang well covector parallelism to apply, and well the hardest job to figure out if they can be parallelized at all or if there are any data dependencies and stuff like that. So that of course needs more intelligence from the compiler as I'm writing here GCC does some things work but in general there's more work for us to be done there. So if you want to go for performance you should for now use the parallel construct with GCC. So then the status in GCC upstream. Five years ago we started with the 2.0 open ACC specification and then a few years later 2.5 came out which is mostly supported in upstream GCC. 2.6 is in our development branch and 2.7 has just been released a few months ago at the supercomputing conference. We have not yet started working on that. We have not implemented all of the open ACC specification. There are some features in there that users are evidently not using very much and instead of spending time on these just to claim complete support for the specification we rather than focus on performance tuning of the stuff that users are actually using. So we support code of loading to NVIDIA GPUs as I mentioned that's what this whole project started with and while the maintainer of the NVPTX back end is actually in the room here, Tom Defriis now working at SUSE then we have AMD GPU support that was recently done as a separate project very much building on top of the stuff that we did before. So they literally had to write a new back end in GCC for AMD GCN code generation. They had to write some of the run time library code that talks to the GPU compute stack so how to get the code to the GPU so that you can launch it there and do the memory mapping setup and that kind of things but that's basically all they had to do. Others could be done as well including multi-threaded CPU which some other OpenACC Compiler support for example. What we've done is very generic, not specific to a specific host system so we can support anything mostly that has drivers to talk to some accelerator device. Self-through testing on x86 and PowerPC64, little Indian. Right, and if you're interested in helping just send patches I'm the maintainer of the OpenACC support in GCC so I will review them hopefully or talk to us if you need services support stuff like that. So we're a services company so I'm mentioning this. It's difficult to sell a compiler that is available as free software but of course you're all familiar with that problem I suppose. Okay, then on to some project that we did last year. The Alice Dalton application, chemistry, simulation, calculation of molecular properties. That's about all I know about it. I have not looked into it much at the level what it's actually doing I just have been given or we as a group have been given this application with the goal to tune that for the performance that you get with GCC OpenACC with NVIDIA GPUs by comparing to the PGI compiler which is kind of the standard you would use with NVIDIA GPUs. The PGI compiler Portland group has been acquired by NVIDIA a while ago so they have all the in-house knowledge. Right, and well I made this comment earlier about Fortran 77 so here we have it the history of Dalton application starts in fall of 1983 when I was one year old and still a lot of the actual simulation code is I guess from that age and it's Fortran. A lot of the scientific simulation code is in Fortran still. Okay, so that's just a quick overview of what we're talking about here Alice Dalton just look at the yellow line. It's 800,000 source lines of code for comparison GCC without all the test suite has about three and a half million lines but still it's a huge application. So no way we're going to work through all of that. Includes several external sub-modules has a non-trivial build system as you would guess from an application of that size works with the PGI compiler obviously also works with GCC but it does has some problems to path through all the well this one command line flag dash F OpenACC which you have to specify to enable OpenACC processing so we had to figure out how to do that. Then we replaced some of the code in the application agreeing with our customer because we want to do an apples to apples comparison and as I said the kernels construct support in GCC is not just as good as for a parallel construct yet so we replaced these regions and the Alice Dalton build system did something clever and that is to link against optimized vendor libraries for mathematical computations only for the PGI compiler because that's what this has been set up for we replaced these by the source code of these blast functions and annotated these with OpenACC directive so that's in a way similar to this matrix multiplication example that I showed earlier and there was some strange OpenACC directive usage which apparently other compilers supported or ignored or whatever so we replaced these and then we had several cycles of profiling analyzing tuning tuning here means not to change the Alice Dalton source code any further but to teach GCC to do more clever things and we reported a very few issues to NVIDIA so PTX is what we're targeting PTX is an intermediate language which at runtime gets just in time compiled to the actual GPU hardware that we have and you see here the baseline GCC execution time if this one example that we tuned here is around 230 seconds and PGI compiler a little bit more than 100 seconds so much better obviously but again we're in the same region here so it's not an order of magnitude that we are slower and after several tuning cycles of GCC's code generation we were equal to the PGI compiler actually two seconds better actually but again that's just this one example here but this shows that GCC is up for this task of being usable for such scientific computations of course that doesn't make the PGI compiler obsolete or anything but it was a great success for us you can read more about that in this blog post there then a real world example N-body simulation that simulates a set of N individual bodies like stars in the universe or particles or whatever with distant dependent forces between each pair and the problem is to calculate the trajectory of each body and you can understand if there's a force between each pair that's a lot of computation as when your problem space runs bigger so a GPU can be very helpful here and I will, oh right there's one slide missing before I show the demo I'm showing this with the GCC8 compiler as shipped by Ubuntu 1804 so roughly one year old is the maintainer of these packages here Matthias Klose-Doko, no he wanted to see that so he has done the packaging of that GCC stuff so that's available in your Debian Ubuntu packages you just have to install the GCC8 offload NVPTX package and its dependencies obviously Debian and Ubuntu are not shipping our latest, greatest development branch builds but they're rather the stable GCC release branches so that's something to try out but if you're looking for performance or for the latest features then you will have to look at our current development branch which is also available in public or talk to us about binary releases of GCC which we could also get you right but now the demo, live demo either it works or it doesn't hopefully it does, time's up but that will be quick the laptop is more than five years old it has a very powerful GPU and the GPU, powerful CPU the GPU is not so much powerful here again I'm just showing the hot computational loop which is this compute body force function an update directive that moves memory then we have this parallel construct here and in there we have a loop construct doing the actual body force incremental computation then I can build that that's using g++-8 the g++-8 compiler available on the system it's building and then I can run that so there will be two windows up here one is executing the host executable the CPU executable and one is executing the OpenACC accelerated one and I hope it's easy to guess which is which and that's the end of my talk so any questions? if we still have time for them we'll do okay thank you