 OK, let's start. So I have to talk a little bit faster because of this technical problem here. My name is Kai. I'm the maintainer of LDC. This is the LLVM-based D-compiler. And yeah, today I'm talking about heterogeneous computing with D. So this is a topic which I personally find very interesting. Heterogeneous computing is you have normally just a CPU in your notebook or in your desktop. And with the advance of graphics hardware, you get more computing capability. You can offload computations to your graphics card, doing it in parallel very fast. And you just build one application which runs on the CPU and offloads some computation to the graphics card. And together, this is heterogeneous computing because you are targeting different computing devices. So the vision we had was we wanted to use one language to do the CPU programming and to do the GPU programming. And yes, for this, it's naturally to use the system programming language. So the choice is D. Yes, for my project, it's just D. The work here was done by a team member. It was inspired by a talk at DCONF two years ago. And this team member said, oh, that's an interesting task. I do it. And I think about six months before it landed in the LDC3. So you can actually download the LDC compiler and try it out. So what is a GPU? And why is this interesting? Everybody, I assume everybody here does programming. So you write your code. It's mostly single-threaded because multi-threading is very complex. And GPUs just add this massively multi-threading, but in a very restricted way. It's not like a CPU on a CPU with Windows or Linux. You also have multiple threads, but these threads are independent. They execute the code. You say, this is a thread. Hence, I do it independently of everything else. This GPU is more like the Zimd extensions, which are available in modern CPUs. You have one code or one instruction, but this works on multiple data or in multiple threads. And you have in the GPU architecture thousands of these threads. So you are very massive parallel. If you are doing one drawback to remember, you really have one instruction which works on a lot of data. And we are all executing it parallel. And you have to organize this. And the GPU does this in a way. At the lowest level, you have a single thread. And these threads are built so-called warps. A warp consists of 32 threads. And this is the unit which steps in synchronize. So you always have 32 threads which execute the same code. And that's a problem, or not a problem, but you have to think about it. If there's a branch, one part of the threads wants to branch, and the other part don't want to branch, then it's not in synchronize. And then some of these threads have to wait because they can only step in sync. And if they are not to sync, then somebody have to wait. So the warp is the unit of execution. The higher step is you build blocks from warps. That's just a set of n warps. And this is usually the unit you talk about when you do the programming. And the set of all blocks is called a grid. So this makes up a hierarchy of threads. That's the execution model of the GPU. What you also have to deal with is there are a lot of more memory types. If you do just application programming, you have main memory. That's it. You allocate it. You free it. Or maybe the garbage collector free it. You use it. And often you just use it without thinking about layout of data structures and so on. That's very, very different with GPUs. The first part is they have different memory types. So they are global and local memory. Global memory is really global. This is shared between all threads on the GPU and also with the host CPU. So that's the communication area too. Local memory is quite the opposite. It's memory only visible to one thread. So one thread on the GPU. So that's the big contrast. Then we have shared memory. Shared memory is memory which is shared between all threads, but not with the host. And this is often used for exchanging data between threads. And this is also an area where you require synchronized statements. Then there's also a constant memory. This is constant. You just dump it onto the GPU. And it's only readable. That can be advantage in some circumstances. And GPUs comes from graphic cards. So maybe there are some texture patterns or something like this, which is just red. So it's a special kind of memory there. And very interesting, a GPU is a processing unit. It just has a register file like any CPU. But on a GPU, this register file is also users memory. And in theory, it's just an infinite register file. You have an infinite number of registers there, just because it uses memory. That's the things which make you want to use this device. Well, there are some drawbacks too. So if you look at the graphic cards, depending on the generation or the render, whatever you have, for example, no double is supported. They only have single precision support. So normal graphic cards which is used for games does only require single or only the float data point, not double, which is more used in some scientific computations. So this is a drawback you can encounter. The first generations of GPUs did not support recursion, simply because they do not have a stack. And if you don't have a stack, you can't do recursion. So modern GPUs don't have a stack too, but they can somehow emulate it. So you have all this in mind if you're going to program this device. So if I want to program my graphics card, like in this notebook, what I have to do, OK, I have to think about my target. There are APIs. There's an open API. That's OpenCL. It's based on some, well, nonprofit organization. And there's also a proprietary API that's CUDA. This is, well, it's proprietary, but widely used because the vendor sells a lot of graphic cards. So what I do, I code my function I want to run on the GPU in some high-level language, in this case, in D. I compile it. And this goes to an intermediate representation, which depends on which ABI you use. And then you have, during the runtime of your application, you have to load this GPU file, put it on the graphic cards, and the driver for the graphic cards then has an additional compile or jitting step and translate the intermediate representation into machine instructions, special for this hardware device. This also means if I want to do GPU programming, it's not standalone. I always need a host program to just load and to just load the code, to launch the code, and also to provide the data or get the results back. So I have always these two steps. Very nice. And I want to do this all in D. And why? Because it's massively parallel. Many scientists are interested in it because there are many, many applications of it. So examples are machine learning. A lot of machine learning deals with multiplication of large matrices, which require a lot of space and a lot of data. And it can be done in parallel. So that's a very good application for it. There are also Monte Carlo simulations in the financial world. If you want to know how my stock, how the stock market develops, you can do some kind of simulation. That's also parallelizable or pattern matching in a genome database. That's all big data where you have to look, but it's all parallelizable. More from the roots, it's ray tracing. That's just another example where you just compute and get an image out of it. So scientists are very interested in this. But there's also the games industry. Don't forget, there's big money, and they use it for all sorts of good-looking games. And therefore, lots of opportunities. Okay, what has the developer to do? What does he face him? Okay, for the application developer, I think it's clear. It's just another model to think, how to model an algorithm so he has to think differently. He also has to consider all the restrictions. For example, the different memory models. There's also often no caching involved, so I have to think about memory layout of data structures and memory access pattern, and so it's more bare-metal programming. So the application developer really has to think about it. But let's switch to the compiler developer because this guy has also something to do. And the first one, my vision is, I want to use one language, one compiler, to emit the CPU stuff and the GPO code. So now I face the situation that my compiler must emit, or must target, or produce object code for multiple targets in one run, if I want to do it very easy for the developer. Yes, and what I also need, but it's not part of the normal programming language, I must somehow deal with the different address spaces because I have to tell, oh, this is the shared memory, or this is the local memory, somehow the compiler must know it. And yes, we want to make things easy, but we don't want to put most of the stuff into the compiler. So there's also the question about separation of concerns, what is the task of the compiler, and what is the task of the library? Yeah, let's look at LLVM, what can we do today? The official LLVM distribution has support for two GPU types, let's say this way. That's first is the NVPTX target, that's for NVIDIA GPUs. And the other one is for AMD or 80 GPUs, that's the R600 and the AMD GCN targets. So currently from these both, we only support the NVPTX target. I think it's just lack of a testing environment, not a principal problem. And what we also target is OpenCL, and that's a bit more complicated. OpenCL uses an intermediate representation which is called SPI 5, and this was derived from the LLVM ER about version 3.2 or so, and they heavily changed it. And yes, it sounds easy, I have an ER which is derived from LLVM, but it's not that easy. So the first one is there is no back end in the official LLVM which generates the SPI 5 ER, and it's so different, you don't want to do it on your own. And what's worse, OpenCL is maintained by the Kronos organizations and they have a patched LLVM with the SPI 5 back end. Yes, it's an old LLVM, it's based on 3.6 and they have no official release of it. And here we have a problem, support for LLVM 3.6 was dropped about a year ago in LLVM. So, yeah, problem. The step what we did was that we ported the changes to the latest LLVM. So with this link, we have an LLVM version 6 and also a development version 7 which contains the SPI 5 back end. But this is a very suboptimal situation. So for the commentary or wish section, what we wish is an official SPI 5 back end in LLVM and because it's an open API, it should be not this big problem to integrate it. Maybe a quality problem, but this would be our wish because it makes targeting OpenCL very, very difficult if you say, oh, you need a special LLVM. We didn't have at that time, we didn't have an official organization on how to submit a back end. But with the Google, forgot the name, the back end, Lanai, we created a document on how to do that. So maybe we just need to point them to the document and say, let's do this. That would be good. Okay, so now I'm just using the PTX back end because it's very easy for me. I have the proper GPU in my notebook so I can also test it. What do we need to change in the code generation to support the GPU? It's turned out to be not that much. So first we have to deal with the address space and this means that we have some additional keyword when we declare global variables and pointer types. So with every variable and pointer we use on the GPU, we have to say, what is the address type? Okay, when we look at functions, so we have a different calling convention again. So we have two different type of functions. The first one is the function which is called from the host which is called a kernel. So in the kernel function, have to use this PTX kernel calling convention. And if I have a function which is just on the GPU device and not called from external, then I have to use the PTX device calling convention. Yeah, and last, you need to generate some special meter data to annotate a kernel function so that the graphics driver knows, oh, this is the kernel function. And you do it this way. You have to put the reference to the kernel functions in the meter data and you have to declare it as a kernel function and that's it. So this is very easy to add and does not pose any complications to the compiler. Okay, it's the source level. It's a bit different. So this is just a kernel function. You can actually compile and load on your graphic card. So this is working code. What we do, or what you have to do is, well, that's a module and we have to annotate the module and say, okay, that's just going to your graphics device. It's contained no host code, just kernel functions. We also need some special imports. So we need this de-compute import specific to our compiler. Just this import just contains the compute annotation, the kernel annotation and some templates. But it's very easy, just 100 lines or so. Yeah, the GPU is restricted, I said it before and therefore we cannot support all the D features. For example, garbage collections or a lot of type infos or exceptions, all these high level language stuff. And therefore, we have to switch this off at least the type info with this pragma. That's really required because that's not a real language module on the graphics cards and it's just not loaded if at the time the module installation is called. So we can't use it. And we prevent the generation of it with this pragma. Yeah, I already mentioned it. The kernel annotation says, okay, this function is a kernel function and this is called from the host applications. If I omit this annotation, then it's just a device function which can be called from another device function or from a kernel, but not from the host. Yeah. signature as usual. We have some templates and there's the global pointer which says, okay, this is a float pointer into global memory. There are other templates called, for example, private pointer or shared pointer for the other memory types. So we support all these available memory types through this template to just use it, put in the type and then you have your pointer. What we also have is some library support because our vision is not only to use the language for both the CPU and the GPU, but we also want to create a library which hides the difference between the CUDA and the OpenCL APIs because they have some, in general it's always, it's the same but they use different APIs, they have different names and so on. And that complicates things. So we create a runtime layer above it to unify the source and this is one of the attempts to do it because you have a lot of threads running and this is the code for one thread and this thread needs to know which data he has to work on. And you usually do it with some information, you can ask the GPU about. It depends on how you arrange the data, you can arrange it in one dimensional array, in a two dimensional array or in a third dimensional array. So you want to know the index into this array and this one just uses one dimensional array so I want to use the X index into it and I simply can just use this abstraction to get that index. The computation for this index is different if you use the CUDA or the OpenCL API but the semantic meaning is the same, I need this index to just know where my data sits. Okay, that's all at the source level. Implementation level, I already said that we have the annotations and the template, it's just called pointer, that's very easy. That's the magic file, we have a couple of these magic files and the compiler knows about these attributes and templates and does the right code generation for it and that's it. Okay, it turns out that we need to switch off a lot of other high level language stuff and there's a compiler switch for it, that's a very nice thing. We have the better C switch and the idea behind this, okay, you are not satisfied with C but for some reason you can't really see full blown D, maybe you're using an embedded device and you have no threads, no garbage collections or so and then you can say okay, I want to use some of the advantages of D like in a C like way and this is exactly what the better C switch does. So we turn off some of the advantages but we are still in the D environment and it integrates seamlessly with the other modules but it's more like C and we need it because simply of the restrictions of the GPU, not because we don't want to support, let's say garbage collection on a GPU, this device simply does not support it and that's the goal for it. What we also did is to implement a new switch to say for which graphics device we compile. So this is the D compute targets and it's called targets, the plural because you can provide a comma separated list of targets and with this one invocation, depending on your source, you generate object files for the host compiler and for M graphic cards. So in this case, we could target different QDA versions and we could also target OpenCL with one compiler invocation. That's very comfortable and this is the vision we had, just one source, one compiler invocation and we get all the required stuff. Yes. What else did we have? My main complaint about LLVM, we support a new target. GPU is just another, is like any CPU, we have to do this stuff, we have to always do, we have to implement a new ABI because there's no real ABI support in LLVM. Then what else? We implemented a lot of stuff in the library and this is also the area which we need to do more work because we did not have all abstractions we like yet. So yes, it works right now but it is all work in progress and the most work required is still in the runtime library. The compiler is okay for now. Yeah, so my conclusion or commentary section, adding the real GPU stuff that was pretty easy. What the biggest problem we faced is we have a very, or the compiler is very old, it's now I think 15 years old and there are some global variables used for the LLVM targets, the use data layout and so on. So the main LLVM data structures, there are global variables for it and with our implementation we have to switch it at runtime and this turned out to be a bit nasty but we have a solution for it and we are working on removing global variables so this was only for this target. Yeah, I already said it, it's still work in progress, especially the library and the abstractions in the library need more work but it's growing over time and again, we want to have this pure five back end. That's really missing. So during the implementation of other targets there was always the problem, oh, I miss this, I miss that. That was very easy with the GPU target so it looks like there's good support but now we have the complaint we want to have a real on other back end edit and we can't do it on our own. Okay, that's it. Thank you very much. Are there questions? I think we have two minutes also. Now you have one. Are you? Okay. So the question is, do we have support for debugging? And I have to admit no. So the switches are in place so we have a switch to generate code but there's no possibility to debug it on the GPU and it's even not that simple to just say let's run it on the CPU because you design your algorithm to run with 10,000 of threads and that's a bit more than your four or eight threads you have on your CPU. So that's really a problem right now. And I'm not sure if I really have a solution for it because it's the GPU device you just submit your kernel you say invoke this and then it runs asynchronous to the CPU and I'm not sure what debugging support that's really available. I have to investigate this. That's why the question you... Now it's clear how I debug this. Well, but it's not clear because the semantics will be different. The SIMD semantics on, even though the results may be end up being the same the steps that you go through will not be the same and then you debug something in the CPU and then when you go back to GPU you don't have the problem anymore or you didn't fix the problem that you thought you had. So that's a really hard one. And a nasty problem is even a simple print F or a write line is not possible because you don't have... It's in the screen, right? I don't know. It's right on the screen. It's actually probably best at the same time as I like. Are there questions? I think you have a visual context. Exactly. It's just a pattern matching it. How difficult would it have been to pour this product to a newer version of LLVM? That's a general question. So difficult is this to adopt this to newer LLVM versions. Well, LLVM changes from version to version very much. And so if you just try to pour something from one version to another version you always have a significant effort. So I don't know, Niklas did all the work to get the Spio 5 back into the LLVM 6. I'm not sure how much time he invested. He required, I think, about 10 months for the whole feature. So including the decompiler and the LLVM part. It's a lot of time. But it's as usual from LLVM version to LLVM version there are so many changes. It's always an effort you can't neglect. Well, but the front, the easier it is to move. Yes, for sure. The problem with this is that it's 3.6 to 5.6. Yes, yes. It's been years. So it's just a completely different component. Yes, yes. So I tried myself to pour something, I think from 2.9 to 3.3. And it's a big, amazing chunk of work. So it's not easy. It'll be useful for me, having gone through that sort of myself, but it'll be useful in the LLVM. There's like a really simple example in the back end. Yes. Just another approach, right? So, another question? Tell me if you saw updated water would be different cold generators would be a way to LLVM and we might be working on upstream. Yes. What I can say is even corals and the companies that are part of corals are aware that this is a problem. Okay, so I repeat that was the good news that there's at least someone working on it to get this pure 5 back end into LLVM. The OpenCL guys, they had a lot of trouble with the number of passes, middle end passes that they had to cancel because they were destroying the GPU code. Do you have to do the same? And I mean, because if you have like GPU and CPU functions on the same file, then it's gonna be really hard to know what to pass on which. Okay, well, what we do is for the question is there are a lot of LLVM passes which must be switched off for the GPU generation. And yes, we just pass the defile, then we have an abstract syntax tree of it and we run the code generation for the targets sequentially and we also use a different pass pipeline for the target. So yes, it's needed. So there are also special passes you need for the PTX or the NVM reflect pass or so, which is not needed on the other targets. So yes, there's a special pass pipeline used. Okay, thank you very much.