 Can you hear me now? Does this work? Okay, fine. Okay, so I'm going to show you array programming and I'm going to show you how to do array program in this language that I've developed along with other people called FUTARC. I'm going to show you how new language cannot all be useful by showing how it can easily interoperate with Python and I'm going to talk a little about how the performance of this language is compared to handwritten GPU code. Okay, first thing, there's two kinds of parallelism. The one that most people think about when the parallelism is a task parallelism where you spawn a thread and the thread can go and do whatever on some data. And you can spawn another thread when you go and do something entirely else on some, maybe the same data, maybe something else, but the two threads are completely independent of each other. The parallelism is where you take the same function or do the same operation on multiple elements of some data set. So the simplest example is what we have in functional program called the MAP which takes two arguments, a function and an array and just gives you back a new array where that function has been applied to each element of the array. So I'm going to use, in FUTARC we use this notation for function applications without parentheses. So this says apply the function MAP to F and this array. All right, nothing magical going on here. Array programming is an instance of parallelism and you've probably already done so because non-par for Python and similar languages is an instance of array programming where we do bug operations. So we just say we want an array of 10 elements. Each of those elements we want to multiply by two and just giving us back a new array and then we can multiply these two arrays point-wise and then sum the result. That's just a dot-dot product doing here. But this idea of doing bulk operations on arrays which can be very large is a very good way of doing parallelism in a way that's easy for humans to think about and very efficient on massively parallel hardware like GPUs. Array programming is actually a pretty old model. It was seen first in APL, which was not popular for some reason. You also need a special keyboard to type it. So let's do any more APL. This is FUTARC which you can type with an ordinary keyboard. It's a very small language. It looks a little bit like standard ML or Haskell or some other generic functional language if you've seen it before. We can define a function that takes its input array of length n and gives us back an array of length n and just adds two to the element of that array. Or we can define a function that sums an array using reduce which is a functional language lingo for use as function to turn this array of values into just one value. Conceptually it just puts this function, this operator between each of the elements in the array. So that sums the array. This function can be any binary function. It has to be associative to be parallel, but that's not worry about that. The nice thing about FUTARC is that it's very freeform. So when you find a function that has some parallelism inside, you can still use it inside of another parallel context. So you can do a map where we use a sum function that we defined up here. So now you have two layers of parallelism. You have a map on the outside and reduce on the inside. That's called nested parallelism and it's comparatively rare because it's tricky to compile. And one thing that the FUTAR compiler will do is turn this nested parallelism which is nice for humans and nice and composable in flat parallelism which is the only kind that hardware can be handled. So you have just one level of parallelism. That's tricky and it won't fit an explanation of how to do that. Unfortunately, it won't fit in a 20-minute talk. So I won't talk that much about that. FUTARC also has sequential loops. It's a pure language so there's no destructive updates where you can kind of fake them by saying, okay, start with a value x that's equal to 1, then run this number of durations and for each duration compute a new value of x with this equation. So multiply x with i plus 1 and then run the loop again until i hits n and x is returned. So it's also just a sugar for a tail recursive function if you use the function in programming. A raise can be constructed with some building constructs, iota, which gives us back, and it's just like range in Python so it gives us back a range of integrative integers or replicate which copies some value. And the arguments replicate can also be raised but I won't be using that. So onto an example, this is a Mandelbrot set or this is a visualization of part of Mandelbrot set. The way to create these nice graphics is just to apply this simple function written in Python to a bunch of comms numbers and you can turn this function just sees how many times this comms number can go through this loop without this condition becoming true and with this cutoff point so you don't need it forever. And you just return how many times you went through the loop and you can use that number to turn it into a pretty car and then you can get a nice visualization. But it's basically just by stopping running this pretty simple function. And since you run the same function on a whole bunch of comms numbers at once you can do that in parallel. In NumPy it would look like this where we'd have an array of comms numbers and then we do some weird operations which for each of these comms numbers we compare them to, this is the stop condition and then we figure out which ones did not stop yet and then for those that didn't stop yet we set the escape count to the loop counter and it's very complicated. The original simple control that we had before in our mathematical definition is kind of gone and obscured. But even worse, not only is it unreadable it's also so and that's what I care the most about because for every duration of this loop which is usually on the order of maybe 200, 300 could be whatever you want but it's usually fairly large we write three arrays so that means we are bound by memory speed because we write these arrays that might be very large memory which is a problem because memory is very, very slow and I'll show you just how slow in a moment in FluTrac it looks like this we have our sequential function that showed you before which I could in Python before just write it in FluTrac that's the same thing let me just map that and it's two-dimensional because you really use this normal visualisation of the complex plane it could be one-dimensional if you had some other interests so the interesting thing here is that there's only one array written because we just go over all complex numbers on this simple scalar function which doesn't use any arrays so we can just keep it all in registers that's a compiler deal but that's something that the programmer can rely on primitive values are kept in registers and then at the end a two-dimensional array is written to memory so the performance difference between these two styles is something like this I mean they both scale pretty well as a number of complex numbers we're working on increases but if you really top out I mean this is a NumPy style where you write a lot of arrays it top out at 12 times faster than sequential code this is not Python code actually it's GPU code written in NumPy style but it's not Python code written in futar style where you don't have all these memory accesses it becomes 350 times faster than sequential code sequential C code there's a significant difference and that's entirely down to this what's the so-called memory wall that modern computation is so, so much faster than modern memory banks that you really... touching memory is just a killer if you write from memory your program is going to be slow that's it, so you do it because we kind of have to sometimes unfortunately the user can't see the value of registers sometimes we do have to write memory or even worse to disk or to screen or what have you well you want to avoid it it's a very, very bad idea if you want our code to fast so futar is a little bit tricky because it's a pure language so one thing is you don't have mutable variables you also don't have the ability to write to the screen or to a file or anywhere it is a little bit exotic you have to compile a futar program and that starts by writing a main function that takes some input and produces some output in this case I just make out some complex numbers I haven't shown you this one it doesn't matter it just makes an array of complex numbers runs the function I showed you before and it sums them all up to produce one integer so it sums all of the escape values of all these complex numbers doesn't complete anything meaningful when you want to run this program you have to tell futar okay run this function with these arguments and give me the result and do that by just compiling it using a compiler and then you pass this input in standard input and you get back output on standard output which is a weird thing but unix people like this in this case I want to ask the when compiling a futar program standalone program it's not for production use it's for deep debugging and benchmark some useful flags one of these is dash t which asks it to benchmark itself in this case it says okay this is the result and this is the run time in microseconds so 611.000 microseconds this is using the futar c compiler which generates sequential c code that is then compiled with gcc with also the futar open shell compiler which generates gpu code and it runs just 7.5 milliseconds so 80 times faster without any changes to the program itself and the program itself doesn't really talk about gpu's at all it just uses these parallel operators map in this case and it just magically runs faster when compiled with the open shell compiler but this is not the way you want to use futar you should have a nice command line for some of your handlebar sets but that's not really what anyone needs in practice the trick here is how open shell works which is a library used to there are two libraries for communicating with gpu's one is cool, invidias, prize everything and open shell which is an open standard but it's much nastier to use for a manual program it's nice as a compiler target though so the way it works is you actually have two programs you have a program running on gpu called the host which outputs code and data to the gpu which is one of the kind of slays that you just send commands and data to so the interesting thing here is that the gpu code doesn't actually compute that much in a well-written program where the parallel seems to be large it's just spookkeeping it doesn't have to be fast in particular you can have the gpu code be in some high level language it's easy to integrate with and just talk to the gpu for you so the way we've done this is that we have added a code generator where the host level code the code view that you see is written in python so this compiler generates python that internally makes also this open shell library to upload code and data to the gpu so you just use a different compiler called chuckpy open shell and you ask it to create a library and then it produces a python module mantelbrot.py which from the outside looks like any ordinary python module so you can start a python you can import it and then it's kind of a strange thing about how you have to use it it defines a class that you understand shade to some gpu state stuff and then that class or that object defines a method for every entry point in the original food truck program in this case there's only the main function and you just pass that ordinary python values and you get back ordinary python values as a result and behind the scenes it has compiled some gpu code and it's asking the gpu to execute this function for these arguments and you can call it again and it'll give you back a different result and for passing arrays from a gpu it uses numpy arrays so you can pretty easily integrate it with existing python libraries although of course numpy arrays are going to be on the gpu so it's going to be some cost of copying band 4 so you can use that if you really want to sum up mantelbrot sets or we could modify that problem a little bit and instead of just summing up those escapes you can turn them into a rdbpixel values and give us back an array a truly array of pixel values like pygame to just split it to the screen and then we would something like this an interactive mantelbrot viewer with a python front end for handling all the keyboard commands and all that stuff but with all the computation happening on my Intel gpu in real time much much faster than the cpu could ever hope to do this so pretty nice division of work between a restricted high performance language and a very flexible dynamic with python language ok so the only reason you would ever want to use a restricted high performance language is to get high performance because it's better than, it's not a terrible language I don't think after all this time maybe that's my Stockholm syndrome talking but it's not as nice as you didn't have to use it so is it fast, is it worth it well it depends because I can easily just show you some benchmarks that show that it's much faster than everything else but you shouldn't trust that because it's very very hard to identify whether a language is fast the only way I found that I just trust the least amount is take existing programs that are said to be written in a good and decent way and then pour them to my language and say well this is how fast it is now unfortunately most benchmarks don't really implement algorithms that are designed to be parallel so it's kind of, I can't use a normal language benchmark shoot out game that one that Debian has or whatever because it's usually sequential programs side effects and writing to files, screen and whatever and my language can't really do that there is a benchmark we call Dynia which I don't expect anyone to know unless they've done HP in academia it's handling open CL code so have it in GPU code of very very quality much of it is written by doctors who should really stick to human bodies and write code or by physicists which is at least fast so we've ported some of these Rodinia benchmarks to food truck and we run them on an NVIDIA GPU and an AMD GPU and these are the speedups we get compared to the original code I don't know if anyone, if you can see these small numbers but for this one the food truck version is four times faster than the written version on an NVIDIA GPU and on an AMD GPU is two times faster of course this just this most means that the original program is bad there's no way a compiler should be able to generate code that is significantly better than something written by an expert who put enough time to write good code so these benchmarks are not sufficient, are not very good except some of them where we don't manage to beat handling code but we get close and of course the food truck code is vastly easier to modify and understand and extend that's what we're going for not really trying to beat handling code written by experts but we get close by providing a much better programming experience along the way and this time we're 17 times faster because we've got something that was parallel than we listen to so it's hard to prove that your language is good when everyone writes so slow code so I'm sorry it's a small language, it's very simple to learn it's high level, so it's not actually GPU oriented we could generate multi-course view code as well it's purely functional, it's weird but fits this parallel pretty well it's data parallel, so there's all the operators that have parallel meaning and the compiler understands and can optimize them and we in return currently have a compiler that generates good GPU code in the future it will also be able to generate good GPU code and maybe in cluster code we haven't gone there yet we have a good idea how we can integrate that with other language and applications not just Python, that's what we have right now but we could easily create an Erlang, Ruby or C-Shop or whatever from Java, doesn't matter, the host code is very simple and the performance is okay, we've also tried a more challenging benchmark where we don't need them quite as much but do pretty respectably and of course the future code is much easier to understand and it's all available online and on an ISC license and all that stuff all right, so that's it thank you very much Truls I have a few questions yes I've got a question we have a Python interface and I was asked will we also have a C interface it's not quite as mature because C doesn't have a well defined standard for what a multidimensional should look like for example in Python we could just use NumPy conventions but it should be very easy and we do generate a C code that works fine as a standalone executable and it would be easy to make that a library too do you have it? yes yes yes yes that's what the so I was asked whether we have a compiler that uses the C wrap instead of the Python wrap that's what this one does it just doesn't have the dash dash library option so it doesn't generate fully reusable code it doesn't generate the nice wrap code but that's just because again C doesn't have the equivalent of NumPy that we know some conventions that we can just adopt so it requires a little bit more thought about how we create an API for this yeah sure you can manually sorry so it requires much because NVIDIA doesn't bother support so I was asked about what I was for is necessary so there's two answers to that the first is I have often questioned the system of doing a PhD that requires working graphic drivers on Linux you can get them you all know that problem NVIDIA itself NVIDIA doesn't open shell much they support it but they support a fairly old version and since NVIDIA hardware is so popular we can't use features that are newer than from OpenShell 2 I think that's the newest NVIDIA support that's also all we need it works on all GPUs I've seen NVIDIA AMD and Intel and also an AMD one an ARM one at some point I think try it out so we don't use any fancy features so not GPUs yes I tried it on a cn5 it doesn't run very fast once correctly it's portable OpenShell code but the compiler has some assumptions about how memory should be accessed too fast that is valid on all modern GPUs but it's not valid on cn5 as fast as I can see so there's a significant slowdown but that could be fixed that's just about tweaking the correlation pipeline for cn5s I think but I don't know that and that's what an old cn5 they made a new one I haven't tried that one yet thank you very much again next up is the lightning talk session if you stay here for the first lightning talk you stay also for the last one no movement one out it's really cool work thanks the laptop is pretty long I'll start with one of these no no okay yep our lightning talk on the app it's now we have a card yep the next one is 2DM okay uh yes you're in too but we can have a single left for all of your lightning talks yeah there's because you know I yeah you can use obviously because I can't oh sure so let me just try and make sure you get all the lightning talk people already so yeah so I see all of these here I see hold on maybe they run out Nico can you come up here so Nico is here so that's 4 Nico and his friend I guess and they haven't seen B yet so do you also do it them no it's just the javascript that's that's the application am I now you're in control yeah yeah well B is in period she's the last the same it's more than half an hour I'm doing a lightning talk we basically have everything loaded on this laptop right and as long as you don't bring anything specific that you need us to like a memory or anything we'll be using this one it's possible you can check if B is outside yeah sure just right on the corner yeah it's B here already the last speaker for the lightning talk B no speaking of