 Okay, cool. Thanks, Kenneth. Yeah, hi. Hello, my name is JP Lehr and I'm a member of technical staff or software development in the Rockham OpenMP compiler team here at AMD. And today I want to talk about the AMD AOMP open source OpenMP compiler for AMD GPUs. Well, let's see. First, as a disclaimer, the information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, or typographical errors, or may change. So just the information is provided as is. So just to get this out of the way. Okay. So now in this presentation, what I want to do is I want to first give a high level overview of the Rockham software stack before we actually look into AOMP with its software dependencies. And then we're going to dive a little more into how AOMP compilation and linking process works and how the AOMP architecture actually looks like to finally walk through a pretty like simple example of an open MP target of low code and what the compiler and the runtime would actually need to do. So let's let's start off with the with the Rockham software stack. Rockham is the open software platform for GPU compute. The platform is built for flexibility and performance and gives you access to the open compute languages compilers libraries and tools designed to accelerate code development and solve the toughest challenges in the world today. Right. According to the website. So with all these components Rockham is a pretty complex software ecosystem. And we're first going to look into how these are organized. And if you would actually want to install Rockham, the Rockham software stack. And then, first at this or at this point a brief disclaimer, I am not a packaging engineer for the Rockham software ecosystem. Okay. But but I want to have a quick walkthrough here of the actual packages that are provided for the for the installation. But since I'm working on the open MP comp, also compiler, I will focus on the open MP part. Okay. So here, this slide gives you an overview of the different meta packages that you can actually install for the operating from the operating system packages. So right at the bottom. We have the hard war hardware platform that is for example AMD instinct accelerators. And then on top we have the actual operating system. So for example, redhead or open Susie or Ubuntu. And then right above. We actually have the drivers, meaning the kernel GPU driver or probably better known as the KFD driver. So this is actually independent of whether you're actually using the the meta packages for the Rockham release or as we will see later the AOMP standalone installation procedure. And then sitting on top of the kernel GPU driver, we have the runtime or actual like multiple runtimes. So at the lower level run, we actually have the Rockham language runtime. And on top of that we have different language runtimes for for hip, or for open CL or for open MP. And so when we're going further up the stack we're moving into the into the Rockham software development kits and the developer tools. So this is where we have libraries and compilers or partly debuggers, and in my point of view. And then if you go even further up the software stack, then we go to like high level libraries for like machine learning and machine learning SDKs and then developer tools. So here we actually classify developer tools, including debuggers. But yeah, anyway, so so for this talk, we actually mainly concerned with with AOMP. And I typically think of AOMP as a like faster paced development preview open MP compiler, of which the features will at some point like trickle down into into Rockham. And then so this is basically here right open MP. But not to give an idea of why meta packages are used. Let's let's look at the next slide that outlines what are like actual components are pulled in in the respective meta packages. Let me just briefly answer the door real quick here. I'll be back in a second. Apologies for that. But yeah, so let's look at the at the actual packages that are pulled in via those meta packages. So this slide shows the actually so called associated packages and what that means is that if we look at the Rockham open MP SDK meta package, for example, we see that it will actually pull in a number of packages for, for example, the the Rockham that the Rockham core Rockham LVM and the open MP extras devil package, right. In addition, it will pull in the the Rockham language runtime package. And that is again a meta package. So this will be resolved to the HSA rocker runtime and the Rockham core package and the com gr package, and the open MP extras runtime package. Right. So there's there's a lot going on here. But now looking at this picture we can actually see that that many of the other components of the Rockham software stack aren't required for the open MP work. And as we see later, the MP build scripts won't even pull in a lot of the other Rockham components. Right. And so since I talked a lot about Rockham, and sometimes mentioned a OMP. Let's look at the two and understand what they are, like in particular, how they relate to each other and, and then also to LVM. Right. So in my vocabulary, sort of Rockham means specifically the AMD open source platform for GPU compute. So it consists of all the like libraries and tools, and is the thoroughly tested, let's call it released product by AMD. And on the other hand, a OMP is an open source compiler and runtime for open MP target of load. It is updated much more frequently as we actually mirror the source code to GitHub. Typically multiple times a day. And I think of a OMP as sort of as I said, a preview version of what may trickle down into the open MP compiler as part of the Rockham software stack. A site node. A OMP does not go through all of the Rockham release testing. So in A OMP, you may see things regress on one day and get back to the expected results the next day. So it's much more like living on on head, right, compared to like LVM upstream head. And then finally, we also have mainline LVM, or what I will also call upstream, right. So this is like the LVM project, a GitHub LVM project part. And so a OMP is an LVM fork that is both ahead and behind upstream. And so we have, we have started to put much more effort into upstreaming, or at least proposing changes into mainline LVM. And even though I'm not going to talk about that here too much, our team is also very active in upstreaming open MP upload support for the LVM flag project. And so in summary, Rockham is sort of the AMD open source compute platform, and A OMP is an open source open and open MP compiler that can be looked at sort of a preview version of what may come to Rockham for open MP support. And then there's LVM mainline, which is the open source community developed compiler to which AMD is actively contributing to. Now let's, let's get some jargon that I will be using throughout the talk out of the way. And I just want to make sure that we understand the same things when I when I use these words. This at upstream is mainline LVM project. Right. Then when I talk about a host, it's a host machine. So for example, an AMD epic processor based server. And then a target device or a device or a target is an attached accelerator. So for example, an AMD Instinct MI 200 GP. And then we have a host runtime that is the open MP runtime for the host functionality. So think of lip amp. And then there's lip OMP target. That is the the runtime component of the open MP host functionality that takes care of dispatching work from the host to the device. But it is itself running on the host. So this is the target or device runtime that actually implements runtime functionality on the device. And it's, as the name suggests, mainly if not not all of it is actually running on the device. And then an important concept is a plugin, which is a vendor specific implementation of functionality required by lip on target to execute a kernel on the device. And a kernel is an open MP target region, mainly like a piece of code that is compiled for and executed on the device. And then you may hear me use the words Q and signal. I will not try to like go into any particular, like detail here, however, should I use these terms, I'm likely to refer to HSA concepts that are part of the lowest layer of the amp implementation. And instead it is the HSA runtime talking directly to the to the like rocket slang when language runtime. Okay. So now that we have a somewhat clear picture of what a UMP is in the light of the rock and software stack. Let's look at a UMP itself, where you can get it and how you can build it and what you can use it for. Okay, so a UMP is an open source open MP compiler that specifically targets AMD GPUs. You can download it at GitHub, and I put this QR code there. So that will take you directly to the repository. A UMP is based on LVM clang, as it is a fork there of and tracks upstream pretty closely. So a UMP is typically just a few hours behind upstream, though it may drop to a day in case we have to work around certain upstream patches that would break downstream functionality. So added downstream functionality as of now includes both optimizations and additional features. One of the most important additions in a UMP for our users is actually are much faster reductions when compared to upstream LVM. And an additional actual additional feature is the OMP T that is the open MP tools interface functionality or the device site callbacks and tracing that is available in a UMP but not yet in upstream. At least at the time I put this presentation together. And this is still true. We had several patches up for review and we're reworking parts of the implementation to better fit the upstream needs. OMP T is coming to upstream at some point. We currently also have some bug fixes only downstream, though we actually really try to submit fixes to upstream instead of applying them to a OMP only. However, depending on the area of the code, the divergence between downstream a OMP and upstream LVM can actually make it pretty challenging to determine a particular fix for a bug. And finally, a OMP comes with Fortran support. And currently that is provided via classic flying, but our team focuses on moving the LVM flying project forward and enable target upload support there. This is all done upstream by the way. So the whole Fortran work is done upstream. So if you are curious about a OMP, you can you can go to GitHub and download it. Now, the OMP repository itself does not contain the actual LVM sources, but a number of scripts and config files to build a OMP from source. It is standalone in the sense that it requires only the KFD, the GPU kernel driver, and LibDRM to be installed on the system. This is actually also isolated from ROCCM installations via custom install prefix, and it uses R path on the runtime lips. So that we can actually find our own lips. And this makes it possible to use a local OMP installation on a system that has a ROCCM release installed. So for example, on my local development machine, I have a ROCCM installation to have access to the to the management tool like ROCCM SMI and for certain functionality checks that I want to run, but actually build and develop in a separate OMP installation. Now, building a OMP from source involves building some some required ROCCM components like the ROCCR runtime that actually provides the HSA implementation. And we will look a little more into detail at the various components on the next two slides. However, in case you do not want to build from source, you can find distribution packages on the AOMP GitHub for send us seven to nine and slash 15 sp4 as well as for Ubuntu 20 or four and 2204. Now, in getting back to the to the ROCCM components mentioned earlier. AOMP includes a variety of ROCCM components, for example, the ROCCGDB or the ROCCR runtime to keep track of the correct versions of the dependencies for a specific AOMP version. We actually use a manifest file that encodes the component, the repository and the tag or SHA of that reports repository that should be pulled in. So for example, AOMP lives on a particular ROCCM release branch for most of the components that will only change for a new ROCCM release. Right. However, sometimes it is actually necessary to update to a more recent commit or tag to get a needed bug fix. And so for for AOMP's own point releases, the manifest file is created separately so that it is available for future builds. In addition to the manifest file and for each component, the AOMP repository actually has a build script that builds that particular component in a configuration that is suitable for AOMP by setting the right CMake flags and installing it into the AOMP install prefix. So here's an example of a manifest file. And in line 20, for example, this is here, potentially a little hard to read for you. But the file specifies that for the remote ROCC tools and the path ROCC profiler with a component named ROCC profiler, the revision to pull in is the tag ROCCM-5.4.3. Now this means that we pull in the ROCCM 5.4.3 release version of ROCC profiler to build the particular AOMP version. And on the other hand for the ROCCM device lives, that is line 14, so a little further up here. We use the public remote of the Radion Open compute on GitHub and pull in the branch AMD STG Open with a particular revision listed there. Now this may be updated more frequently to sync the development between the device lives and the AOMP compiler. So as we are now somewhat familiar with the ROCCM software stack and how to obtain and build the AOMP compiler, we can now look into the compilation process and the device plugin architecture and how an actual OpenMP target of the program executes on the device. The compilation process actually requires code generation for both the host and for the device. So that means the compiler is actually invoked twice for the different LVM calls as target triplets. So if you consider a C or C++ OpenMP application, we can now follow the compilation process to go into a device tool chain, meaning that we first create a device IR, which is intermediate representation, which is then put into the device assembler, which creates the device object. If we are looking at the bottom of this picture, the same code goes into the host IR and then into the host object. Now finally, both object files are bundled together in what is typically referred to as a FAT binary. Now in that FAT binary, we still have code for the host and for the device that we need to take care of separately in the subsequent link step. Now the linking process requires to generate one host image and an embedded device executable. So we start with the FAT binary and first unbundle it into the device and the host objects again. The device object is moved through potentially device specific tools to create a device executable. And that device executable is wrapped into a correct ELF image. The linker then creates a correctly linked host executable in which it embeds the ELF image that contains the different images for each device kernel. Here we go. So those images can later be loaded in the lip on target plugins and be executed on the specific device. Now as part of that it also links any required libraries such as lip on target and others like. Now, while I do understand that the process to some extent, I'm not the primary expert for all technical details of the linking process and of this like whole driver stuff. So in case you have questions about this later on, I'll certainly try to answer them, but I may not know the answer. Okay, and then moving from the compilation and linking process to the execution time. The process is roughly as follows. So the executable starts like any other program on the host. The regular loader will load it and invoke main for CC++ applications. Moving to the application is other dependencies. It has the open MP runtime library and lip on target libraries as dependencies. Now when lip on target is initialized, it will look for the so called plugins. These vendor specific implementations are then used to separate common code, which is in lip on target from platform specific code, for example, the, the, the one for AMD GPUs. And the application hits a target region. Sorry, hits a target region, lip on target dispatches the, the so called kernel launch to the plugin. The plugin then uses the function name to open the device image that was embedded into the executable, look up the pointer to their code and load it. And then it initializes a kernel launch of that pointer using the particular launch configuration to obey any user specific number of threads and similar requests made through the open MP API. And finally, the kernel executes on the device and the plugin D initializes if required after the kernel finished should the kernel be, for example, a reduction. It may use helper functions on the device that are provided via the device runtime. Now this is similar to things like the lip see or live M a library that provides implementations for commonly for common and required operations. And then on the device via standardized interface. Now I've mentioned the plug in a couple of times without ever showing what it's what it is or where in the stack it actually sits. So let's look at at the plug in a little closer. The AMD GPU plug in is the vendor specific, meaning AMD specific part of lip on target. And so if you look at the picture we see that we have the hardware sitting at the bottom with the driver and the HSA runtime sitting on top of it. Now HSA stands for heterogeneous system architecture. And it is a standard mine maintained by the HSA foundation. The standard provides an interface to interact with heterogeneous systems that are composed of multiple agents that can talk to each other. In addition to the standard functionality AMD provides some extensions that are used in the AMD GPU plug in, for example to profile certain calls, or to prepare memory regions for data transfers between host and device. The AMD GPU plug in is built on top of the HSA runtime right. It uses the use of HSA locked up in memory for higher performance data transfers between host and device. It uses the was called HSA signals and cues to launch kernels and data transfers and manages dependencies between the signals to take care of asynchronous events. For most of the open OMPT profiling support, it uses the HSA profiling extensions to read the timings for specific signals and reports them to the OMPT client. Maybe one more as a note, HSA is pretty low level and should be used with some caution, mainly meaning that it is suited to implement a runtime system, but probably less well suited to implement higher level application logic. Although nothing would technically like stop you from creating signals and writing them to HSA yourself. You can see that the OMP plugins actually sitting on top of HSA and is queried by the by the target runtime and then dispatches into the HSA runtime. Now finally, with all those things said, let's look at a simple open MP example program and what it would look like when executing on an AMD GPU after being compiled with a OMP. Let's start with a C++ application that does not have any open MP annotation in it, right, it creates a stack array and initializes its entries with zero. Then it loops over the array and assigns each entry the value one, it then prints the values in order and finally return zero to indicate all went well. When we add the open MP pragma, this means the following. Target means offload the following block to a device. Teams distribute is distribute the work across teams of threads. And parallel for is the work share, work share the iterations of the following for loop across the team of threads. And then we have this map valves, which actually creates a data environment for that target region by applying the respective data movement. If you were to leave out the specifier in the map clause, this defaults to the now highlighted to from specifier. That means that the data environment shall be created by moving the data of the valves array from the host to the device before launching the kernel region. And then after the kernel finished moving the data from the device version of the valves array back to the host version of the valves array, that is the word from basically copy the array changes values and copy the changed values back. So putting it together we can think of the code as as this code actually executes on device. And then in in the lip on target. This code will actually hit the following markers. First, you do a data submit that is copying the data to the device, followed by a run region. That is actually execute the kernel on the device, followed by a final data retrieve, copying the data back from the device. Inside the data submit live on target would also create a mapping a mapping table to track which areas of memory are mapped to the device and which device pointers are assigned to a particular region. So this list is reference counted and should reference count for for particular device storage go to zero the device storage may be freed. And I see I'm actually running almost over time so I'm actually finished here I just want to like wrap it up. So in summary amp is an open source compiler for open MP target upload based on client LVM. It supports C and C plus plus and Fortran, and is much more frequently murder to the public GitHub instance than rock and releases features are typically merged into future rock and releases though understand this as a features may go into rock form. Compared to upstream LVM amp offers some more optimizations and features, and how we are actively working on reducing the delta and would like to upstream more of the currently only downstream changes. In general amp provides a standalone compiler that can be built using the scripts provided in the repository and actually only requires the kfd driver and live DRM to be installed on the system, as it pulls in the other required components and build some from source. And with that, thank you so much for your attention. I hope you found a piece of information interesting for you. Now I'd like to open it up for for questions. Okay, thanks a lot JP, can you hear me. Yes, loud and clear. Right. We have any questions in the room. Yeah, can you come up here that's going to be easy. The backwards. Thank you for the talk just one question, because I noticed the other day that in DC 13 they also added support for our flow to the newer and the CPUs for open MP. I'm wondering if the TCC open MP is in any way related to AOMP or it's a completely independent project. So, let's say I'm, I'm not aware of, like, that we that we add to that. I know there is a GCC. Well, so we are in touch with a contractor that like puts into puts off loading capabilities into GCC. But I'm not like 100% sure how much that is. And the run times as far as I understand, are developed separately. Although people are talking to each other. So I guess that's the best answer I can give us. I think there is talks. This might be actually related but I'm not completely sure other questions. They want to unmute themselves or we have some questions in the zoom zoom. Yeah, maybe I'll repeat the question. It's basically the same questions sort of by Thomas and Lars in them. In the easy in the EUM spec channel. It's back to slack channel. Very nice. So Thomas was asking, I'll just read the whole question. Maybe this is too technical, but if I want to test my code on some machine using the AMD computers, and the test machine has Intel CPUs and no AMD GPUs. Does the license allow me to test my code on my desktop or CI system, but am I not allowed not allowed to install the software stack there? Because step zero is making sure that I can build with a compiler before I go anywhere near testing of loading and so on. I'm not completely sure I understand the question actually, just because I'm, let me say I'm not aware of any legal issues installing AOMP on your machine. But I'm also not a legal person. Maybe this applies to AOCC because there are some restrictions there. This may be true, but so AOCC is different from AOMP. If you were to have an AOCC installation on the system. I believe that AOMP would pick up any proprietary like optimization passes that you may have installed and use it. However, if you just go to GitHub and download AOMP, it's there to be used. It's open source and it's released under the license that is available in the repository. And then the largest question is somewhat related. So he was going to ask if Rockham worked for the CPUs for us at playing at home. And then he said he just found out that some Rx 6000 series card got partial support about two weeks ago. It's a bit of a roadmap to road bump to adoption and you can't play around with it much. Yeah. I'm not sure how I got to answer that. Let's say so I for my for my development machine, I have an Rx 6800 desktop card here. I believe for AOMP, you can use a lot of desktop cards. Now, what I find very helpful is on the, if you look for the LLVM documentation of the AMD GPU back end, it will actually list like numerous AMD devices. And here what we call GFX numbers. So like the code version, not the code version D. The architecture, the actual device architecture. And within our C makes you will see all the GFXs that we support. Now, I believe the, like the non support that you will typically see is has a lot to do with the optimized libraries that you will get as part of the Rockham stack. And therefore, let's say using an open MP program and compiling it with AOMP for a device where would actually the GFX number would actually be supported from our CMake, you should get an executable program. Now, if it doesn't work. So the differences, for example, some of the data center GPUs have what is called a way front size of 64 whereas consumer cards have a size of width of 32. So we put them in the past we fix them because we use consumer cards for our development as well. So, I think it's, it's different to say that Rockham supports a certain like card or or a certain series of cards, and you can like generate code with our open MP compiler. I think these are two different things and you should be able to use our stuff on like many desktop systems. Okay. We'll wrap up here. Thank you very much. Thanks for having me.