 The next talk is Jeffrey Vedder is going to talk to us about making SDRs more portable in the area of heterogeneous socks. Good afternoon everyone. I hope you had a chance to go out and grab some lunch. It's an interesting experience here. My colleague Se Yong Lee is with me and he's going to help me give a part of the talk. You've already heard several of these things this morning where people are porting SDR to different heterogeneous architectures. I would argue that's not just an SDR phenomenon. We're going to see that everywhere. At DOE, for example, all of our next supercomputers are going to be GPU based. We have lots of users struggling with how to port to these architectures. We're part of the DS SOC program and we're investigating the performance portability SDR. We've done a lot of profiling and trying to understand target architectures. This really isn't different from what we do with our DOE applications. It's different software, of course, but we sit down. We don't know astrophysics. We don't know materials design. We sit down with those people and try to learn how to make their applications perform well on these architectures. That's what we're doing here. We're trying to use open programming models to do this. One of the things that's really important is future-proofing your application. You don't have to worry about new architectures needing to recode your application because of that. Finally, intelligent runtime systems. We've heard a lot about earlier. We think that's going to play an increasingly important role in addition to code generation and the performance of the applications. The goal that we're looking at is being able to untouch the same application and run it on a Qualcomm Snapdragon and run it on the largest supercomputer in the world. Just see if we can meet that goal. It may not happen, but that's the challenge we're aiming for. Just a short note about specialization. I love this chart by Ray Kurzweil that talks about the growth of computing since 1900. It really shows you where we are today. We're nearing the end there where we've had some outstanding gains from CMOS scaling and other things. Now we're really getting into this transition period. I label it the sixth wave. I think there's a lot of things that are going to happen and we're already seeing a lot of this. The first one is people are really specializing in their applications and optimizing their code for the architectures we have. The other one that I'm not going to talk about is emerging technologies. You've seen all the excitement about quantum computing and neuromorphic computing and all these other things. That's what people are looking for, really the next computational paradigm. Today I'm going to really focus on architectural specialization. I think this is where we'll be for perhaps 10 years or more where we take the CMOS, which is an incredible technology, and just specialize it for workloads. That's the specialization part. It's really important that you get the specialization right. Consequence of that is that the architectures start to yield very complex programming models. In this schematic I use for talking about high performance computing, in the sense that even if you are in astrophysics or material design or climate modeling, your application these days has to be made up of this stack where you have maybe MPI, some threading model on the SMP, and then if you're using an accelerator, open ACC, CUDA, open CL, something like that. It really varies across platforms. CUDA is not available everywhere, of course it's proprietary. Other things like open CL may or may not be available. When you look at it in that context, really, and this is generally for the DSSoc program, is first, how do you design that architecture? I show up on your doorstep and say I have 10 billion transistors, they're yours, what do you want on your chip? How do you analyze your workload? Part of our project focuses on looking at SDR to try to understand what those features would be. The second one is programmability. Is it really possible to design an application with one language and programming model that will run across all these architectures and you get reasonable performance out of them? That's a question, but it's going to impact all these teams writing applications and software. They only have so much time to spend on science and importing software. This was the DARPA program and it has three different areas, some of them I've already mentioned and I'll talk more about in a moment, but the important part here is you really do assume that you're going to have SoCs and they're going to be very diverse. I'm talking way more than an ARM CPU and an ARM GPU. They may have inference engines, they may have FPGAs and other things on them and you've already heard about some of the complexity dealing with those when you just have one. The overall structure of our project really started out like this. We were looking at applications, we're using an ontology framework to try to capture the differences and the properties of things in SDR. Then the programming system is what Sayong is going to talk about. This is really static analysis and code generation. If you're going to run your SDR module on an FPGA, you just heard a great talk about what all goes into that. Is there any way we can simplify that? Then next you've got the runtime system and there's a big question there. Will Linux evolve to handle all these heterogeneous processors and memories that are coming its way or will that be handed off to another runtime system? We've developed a prototype runtime system that I'll tell you a little bit about that's trying to explore that space. Finally is the hardware. Several of the teams in DS Sock are actually building hardware. We're not, but we're reaching out and making use of some of the more complex systems out there. This is just a timeline of how we see this manifesting itself over time. I'll skip that due to time. Let's just talk a little bit about the range of hardware that we are thinking about. As I mentioned earlier, we run on Summit right now. This schematic you see, it's a 200 petaflop system at Oak Ridge. It's got 27,000 voltage GPUs in it. They're all configured in this node schematic you see here where you've got six GPUs and two power nines. Getting all those programming models that I was talking about to work properly and efficiently on that is very complex. Then you add the fact that you've got thousands of other nodes like this that are communicating at the same time and potentially doing IO and it's a very dynamic system. We just heard a nice talk about FPGAs and being able to use them for signal processing. We're looking at those. Then more recently, we've been looking at NVIDIA's Jetson. Again, it's an SoC, but it's very much an aggressive design. It has ARM cores. It has a voltage GPUs smaller than what you get on a V100. It's got a deep learning accelerator and some other things like vision processors and vision processing accelerators. When you get right down to it, that's what we think the future is going to look like. We're going to have thousands of SKUs of different types of processors and you're just going to buy the one that fits your niche. Then, of course, there's Snapdragon. This is the one that we're focusing on this year. It is not as well known but more popular than Xavier that I just mentioned because they're cell phone chips. They're very complex. They've got ARM big and little cores. They have vector accelerators. They have a tensor accelerator like a training engine and then different codecs and things like that. In addition to all the IO for connectivity to 5G and so on. It's a very growing space and it seems to be accelerating. Just a word about some of the application studies that we're doing and we have some tools that will make available if you're interested in looking at those. When we put together apps, we're not experts in SDR. We're learning a lot and so we wanted to go out and get some workflows that would allow us to understand how to profile and how to analyze these. We started with the Wi-Fi one and it's in part because we really understand how to measure the performance of that on an application level. If you do a file transfer and you're running it across this flow you can get some real feedback as to error rates as well as throughput. That's something that we've been studying We've also built tools and these tools that really focus on the ontologies and their performance analysis that I mentioned earlier. The first few were really looking at all the flows in the system and the later ones are more for profiling that. We built some frameworks that let you put these flows up run them for say 30 seconds and then do a profile of where you're spending your time in the flows. This is an example of where some of the time is going and it's pretty much a normal profile. One of the questions is are we getting representative flow graphs and is 30 seconds really sufficient? The cooler looking analysis is this block proximity analysis and so what we did was look at Sea Grant and some of the other repos we had access to and what you see in this graph over here is really a networking diagram where the nodes represent basically how many times they're used and the weights on the edges represent connections between those and workflows. This is useful for example if you're trying to combine operators that you can put on an SOC that's coming down the line or maybe optimize by creating a new module for example. Let me see if I got the... All of these tools are available down here on GitHub. We'd love feedback on them. Next, working our way down, the programming systems is where we've been spending a lot of time recently as well as the run time system. I'm going to let Sayong tell you about our programming strategy. I'll talk about the programming system that we use... This gives the overview of the system. As you can see, we use two compilers as the base one. One is the OpenARC and the other is the LLBM. OpenARC is the homegrown compiler developed by us and we use the multiple programming models like OpenMP, OpenACC, Iris, and HIP, QDA, OpenCL. By looking at this figure, it may not be to understand what's going on in the programming system. Let's look at the concrete example. This shows how we port your generated blocks to the heteron device using our framework. Basically, what we assume is that users can write a program using the high-level programming model like OpenMP, OpenACC. In this example, suppose that user writes a generated block using OpenACC, then our compiler performs a source-source translation and generates the output program depending on the target architect. That means that if you want to run your own generated block on the NVIDIA GPU, then our compiler generates the output QDA task and it will run on the NVIDIA GPU. If you want to run your blocks on the, for example, ARM or other G1 file like a CPU, then our compiler will automatically generate the output OpenMP program and running on the dojo architecture. Likewise, depending on the target architecture, our compiler automatically generates output kernels. One thing about our framework is that we know how different type of output program models we use a common runtime called IRIS. That's another runtime system that we developed by ourselves. One benefit of having a common runtime interface is that you can write an application where one task is running on the GPU using QDA and another task is running on the CPU using OpenMP and yet another task is running on the FPGA using OpenACC or something like that. That kind of true intimixing of the multiple different output program models possible by using our framework. We also support several other new ones like HIP is another GPU flowing model developed by AMD. We also support another new one called CQL but I will not talk in detail about these new flowing models. For example, if you want to pull to your generator blocks on the JBL like SOC, then we can use the QDA for the JBL GPU and OpenMP for the JBL CPU. This shows the code structure of the OpenACC block. As you know, the original GR block is written as a C++ class and we follow the same structure. That means that if you want to write your own OpenACC GR block, then you have to write this type of class but one common requirement is that every OpenACC block should inherit the common parent class called GRACC base block. What it does is that it initializes the data structure used by the OpenACC runtime and it also assigns a unique logical thread ID to each block instance. Actually, the reason why we need that kind of parent class is because of the multi-threading issue. As you know, as you learn in the morning session, the original GNU-reader framework provides a multi-threading. That means that multiple GR blocks can be executed by multiple different threads. But the problem is that our OpenACC runtime also supports multi-threading. But the problem is that original GNU-reader block framework uses the C++ boost threading library-based multi-threading while OpenACC runtime uses the OpenACC-based multi-threading. So two systems does not know each other. So to handle the multi-thread safety issue by integrating two different programming, what we did is that we introduced the concept of logical thread. By using the logical thread ID, the OpenACC runtime knows what kind of thread local data structure should be used. So it's a low-level detail about how to enforce the thread safety when we integrate multiple systems. I won't explain all the details. Anyway, in this OpenACC block, we provide two implementations. One is the reference CPU implementation, which is exactly the same as the original GNU-reader block. And OpenACC implementation, that's the OpenACC version of the reference CPU version. If you write the basic OpenACC implementation, it will perform the three types of tasks. First, at the beginning of the invocation, you should copy the data from the host memory to the device memory. And launch the device kernel, and at the end of the execution, we have to copy the result back to the host memory. This shows the example translation of the OpenACC block. As you can see on the left, it shows the simple OpenACC block for the GR log module. What we did here is that in addition, on top of the existing CPU implementation, we just add one line of the OpenACC directive. Then our compiler automatically generates the host program and the kernel program, depending on the target architecture. This shows the example workflow where we used our own OpenACC GR block. As you can see here, top workflow used our GR OpenACC block. On the other hand, bottom workflow used the original reference GR blocks. The only difference is that both workflows are the same thing, but one used the OpenACC block, the other used the reference CPU implementation. This shows the basic management scheme for the OpenACC-enabled GR workflow. Because the OpenACC does that, it offloads the computation device. It can be a CPU, GPU, FPG, whatever. In this case, what happens is that when OpenACC block 1 is invoked, first it should copy data from the host device. Then after finishing the device kernel execution, we have to write data back to the host. In this case, there are some inefficiency. If you look at the OpenACC block 1 and block 2, actually if you look at the green colored data, we have three copies, and there are some data transfer between host and GPU. But we don't have to do that because the second OpenACC block also runs on the same device. In this case, we don't have to do unnecessary data transfer between host and GPU. So what we did is that if we know that both producer blocks and consumer blocks are running on the same device, we can reduce unnecessary memory transfer between host and the device. This shows the example output of the sample workflow. Let's look at the performance comparison. What we did here is that we compared the performance of the OpenACC block based workflow versus the original CPU implementation based workflow. In this figure, where the block means that each OpenACC block and green block means its reference CPU implementation. Both blocks label the same thing that it does the same algorithm. But one is implemented as OpenACC the other implemented as reference CPU. And here, what we did is that we run the both OpenACC workflow and the reference CPU workflow on the same CPU. As I said before, OpenACC block provides two implementations, reference CPU implementation and OpenACC implementation. So if we target the CPU, then even OpenACC blocks just use the original reference CPU implementation. So in this case, we have to see that both OpenACC block and reference CPU block should run equally. But as you can see in the, for example, D2K, D1D2K, our OpenACC blocks a little bit performs better than the reference CPU block because when we implement the reference CPU version inside the OpenACC block, we applied a very simple caching optimization that causes some performance difference. Let's look at the next case. In this case, what we did is that we uploaded OpenACC block again on the CPU but using the OpenMP as a back-end programming model. That means that instead of using the reference CPU implementation, OpenACC block uses the OpenMP implementation of that block. So here we can see in some blocks, OpenACC blocks performs better than the reference CPU implementation. But in other blocks like the B and C, still, reference CPU version performs better than the OpenACC block. Why? Because some of the original Gunur Ryuk implementation was vectorized using the bulk library. That means that here, what we compared is that the version paralyzed using the OpenMP multithreading versus the version vectorized using the bulk library. So this shows that depending on the characteristics, sometimes it's better to use the vectorization and sometimes it's better to use the multithreading. And next, what we did here is that we uploaded the OpenACC block to the GPU but without any optimization about the memory transfer. As you can see here, even though we upload the computation to the GPU because of the extra overhead for the memory transfer between source and device, in most cases GPU performs worse than the CPU version because of the memory transfer overhead. But when we applied our memory transfer optimization, we could remove some unnecessary memory transfer. So in that case, in most cases, OpenACC block performs better than CPU block. But there is one exception. In this case, still OpenACC performs worse than the reference CPU. It is because the original CPU implementation was vectorized using the bulk library. That means that still we compare against the vectorization on CPU versus the parallelized on GPU. So depending on the computation, it may still be better to run on the CPU using the vectorization. But in most other cases, if we can upload the computation to the GPU and optimize unnecessary memory transfer, then we can beat the performance of the original reference CPU. And this is a little bit more complex, but I'll skip that. So what we can learn from here is that once you create your blocks using the OpenACC and create a workflow using OpenACC blocks, then you can create one workflow that can run on the multiple different type of devices together. That means that part of your blocks run on the GPU and part of your block run on the CPU, another part of block run on the GPU. So what kind of true mixing of the heterogeneous device is possible if you use this framework? Thank you. So just one thing I wanted to say about that diagram. We modified the new radio so that we could add some attributes to the blocks that actually let you say where to co-locate the blocks. Ideally that would happen automatically somewhere along the way. So there would be a phase where you could modify some of those blocks into a super block and then put them on an FPGA, for example. We're running a little short on time. I want to tell you about Iris. So Iris is our runtime system. We've looked around. There's a lot of runtime systems out there. We didn't find one that ran on all the platforms with all the different programming models we wanted to run for the reason Seyong mentioned. Some nodes just don't support certain programming models well. So in some cases OpenCL is a great choice. When things like Xavier OpenCL is not there. So you have to use Kudo and OpenMP. So we created this to help both orchestrate the data movement and launch and manage tasks across a diverse set of system architectures. We have several different models here. We have a memory model that also will track with a directory where your data is in these different devices and inject data movement into the execution of the tasks if necessary. And it really looks like a dag. So everything in computer science at some point seems to get back to dag but that seems to be working for us. We're not talking about millions of nodes. We're talking about dozens or maybe a few hundred nodes in our dag. And then Seyong talked about the programming models. Here's an example of what we're running across today. So with Iris you can take it and run it on this node. For example that has Xeon, Pascal's and Stratix 10 in it. You can run it on a summit node using the requisite programming model on each chip. And then it runs on Radeon, Xavier and SnapDragon. And they list out, we list out what software stack we used to do that. I don't have a demo I didn't think I had time for it, but here's snapshots of it booting on all of those platforms and you can get a feel for how it does that. Here it's identifying that it has the CUDA platform available and loading the shared object for that and skipping things like HIP and OpenCL. So the task scheduling is interesting. Many of you already know about task scheduling so the thing I want to mention here is that we're actually getting the fun part now. We've got this nice framework that has many mechanisms in it that let us schedule things but we want to start exploring the device selection policies. Now you could do a simple thing like have a hint, a pragma that says run all of these tasks on a GPU. But we could also look at things like the profile based or the ontology based or one of the original ideas we looked at was using a performance model to give you a rough indication of where a task should be performed and so, and then once you have that the task scheduler executes that on the ready queue every time. I've got a simple example here of a Saxby and this is running on the Xavier using the GPUs and the ARM CPUs. Basically what we want to do is run the Saxby on the the A times X on the GPU and the Y on the CPU and so one of the things that I mentioned was that we do have dependencies. I'm running short so I'm going to speed through this. There's a C++ interface to Iris and there's a Python interface to Iris. The Python interface is right here. It's a lot easier to explain. You create data regions with Iris then you can launch these tasks and these tasks point to code for example right there's the CUDA code that gets launched when Iris submits this task and it's called up for execution and you have to do the transfers and an identification of how big the data is. There's also the OpenMP version we have to execute a little pre and post execution code for these but we think we can automate that with things like OpenArt and then here's the execution memory management. In this diagram it doesn't really stand out very much but basically what happens is when you identify these read-write dependencies on the different data regions you're basically telling Iris look there's potential inconsistency there and so you need to manage that so it has a directory and it knows to go in and move the data from one device to another based on the task being completed. Alright I'll jump to work. So the other thing was just that we're looking at efficiency we're trying to get this to execute we can do almost 90,000 tasks per second on both the CPU and the GPU it's online so feel free to take a look at it and let us know what you think and with that I'd like to thank Faustum and SDR and DARPA and DOE for funding and inspiring our work. Thank you.