 We on on good Okay, good morning everyone our first speaker today by John Stem We'll be speaking on making code run fast on all the things with open cl Thanks So so thank you all for showing up. It's nice to see there are so many open see all nerds out there as well Or at least I can hopefully make you open see all nerds because the language really does need some exposure We need to get on board as an open community So I guess I should get into it Yeah, so I am talking about making code run fast on all the things and I'm really lucky that I've been invited back again to talk on open see all so I'm not rehashing all of my old slides Is there anyone that was at my talk in Perth last year? No, I could I could have just recycled the same content. You'd never know and then someone on okay. I'll get into it All right, so this isn't last year's talk But thanks for having me back and thank you for showing up Last year's talk was mainly focused on on the kernel the open cell kernel compute kernel I'll get into that later since you don't know what it means now Probably not and also prototyping parallelism with open mp, but today we're more about saving time and targeting the host code Okay, so what we're really focusing on today and if I had a chance to change my slide title It would be about this was last year's talk Time we're mainly focused on time So so there will be a little bit of overlap because it's has to be in it have to provide some introduction open see all But hopefully it won't be all redundant the same same content that you heard before Okay, so yesterday is a cancelled check tomorrow is a promissory note today is the only cash you have so spend it wisely So time is money and when we program with open see all We're we're trying to accelerate computation, you know, that's why we use these accelerated devices We're trying to map some computationally intensive portion of code to An accelerated device in the in the hope of it accelerating computation either because our users Complaining about non-responsive Non-responsive apps because it works on mobile as well as a desktop or if you want to compute our scientific results quickly But there's a caveat in that you don't want to Pick up open seal the first time and it's going to take you the developer a month to get something running just to save So save a few seconds So that's what mostly what I'm focusing on today So this morning will cover. I had some feedback Yesterday, but they thought I was fancy thing about open seal. I mean, it's not a perfect standard It's not a perfect runtime. There are some problems And I'll cover that for early up, but I am very much pro open seal and I'll get into the reasons why later on And then I'll get into why the host is boring. Why is it so? And and we'll get to that and then I'm going to do the boring bits for you And hopefully there'll be some code reuse so you can hit the ground running when you're with your open seal endeavors And then I'm going to have a recap from some of the stuff that was Postulated last year all the up-and-coming features and I'll discuss what what's actually eventuated in the past year Okay, so could I have a show of hands who has slow code who's ever had slow code anything you can speed up? Yeah, okay, but who has time for slow code No, okay, so we're all about time and everyone's heard of accelerated devices Yeah, okay, so so by accelerated devices I'm going to talk about all the accelerated devices, but Namely so we all know about central processing units, right? They're really fast They have a few cores, you know like fast, you know, they're three to four gigahertz usually Yeah, few cores. They're really good at branching, but they're really expensive to run and really expensive to manufacture GPUs aren't so fast, but they have a stupidly large number of cores You know really pushed by the video game industry. So if we could use the if we could use Hardware that's been pushed by the video game industry To to render three-dimensional environments we For our types of computations DSPs are cheap and great at streaming applications And FPGAs are also great at streaming applications, and they're really cool, but they have a Expensive initial cost the manufacturing cost outright pricey cost So so we're interested in using these accelerated devices out of the normal Normal use and target them for non-standard purposes using OpenCL Okay, so so imagine a world where each device has its own language API's ID and ecosystem It's not too hard to imagine in that we have it today And that's why we should be interested in OpenCL Because you don't want to have to rewrite the same blob of code as the next vendor produces something awesome I could single out there are a lot of people who use NVIDIA NVIDIA GPUs And they're locked into the ecosystem because they develop in CUDA When AMD puts out a rival card that has some better feature for the type of application They have to Reimplement it all over again on a new platform and no one has time for that. We're interested in saving time and we're going back there at the moment and less We can really push OpenCL so we need we as a community need to get on board and increase its adoption so OpenCL is Well, it's it's mostly open. It's a standard. It's an open. It's an open standard. It's not necessarily open source so so OpenCL is Well, it's The state there's there's a standard's body chronos group They get together with all of the vendors that are on board on this consortium and the vendors say what they want And then they all agree on some middle ground and say okay Here's your C8. Here's your C API and then behind the scenes. They all implement their own vendor Usually proprietary runtimes, but at least it's at least we can contribute our open source code to this one common API and have it run Pretty well on on the target device So OpenCL is good for code reuse because we don't want to reinvent the wheel as I said before and We can develop for a large range of devices in one shot if we're smart about it I Mostly spoke on that at the last talk so I won't rehash on that But if you want to check out the video from last year, I have a link at the end of the talk Okay, so here's OpenCL running on all the things. I'm sorry that it didn't come out too well In that this is just is just a an arbitrary blob of code that I needed to speed up and it has Order-in-cubed complexity so Every every Time we increase the length of the signal the amount of computation is is cubed. So that's a lot and I'm I'm running it on some some GPUs like the ATI Radeon and a couple of Intel CPUs as the on-fi so the co-processor And a couple of Nvidia GPUs and also a DSP the Texas Instruments Keystone 2 and I'll talk about that a little bit more later but I'd have to get into the kernel so my blob of code actually Emphasizes good Speed up. It's a linear speed up. So for the number of cores we add it, you know, it's it's increasing by that magnitude For most of these complications at most of these computations until there's oh anyway It's not that interesting. You can ask me about it after Okay, so open seal through the ages. So it's a mature standard now. It's five years old and it's constantly evolving So so you can read version one was out in 2008 And the chronos group have been putting out new standards ever since and they're targeting that they're adding more and more Well, the newer devices that come out the standard is constantly evolving Unfortunately also means it's diverging and that's why you need to get involved Okay So so So it's not all fun and games So on the left is the open CL version That's that's the the rows, but the columns are just just a small number of there's more vendors than this that support open CL but can you tell can anyone spot the problem if you were to develop a and an implementation of some code in open CL Remembering that the newer versions are So you can see the newer versions usually have new better features. It makes it easier to program but you can see that There are certain vendors that aren't supporting the new versions of the standard. They're still on there's still on the committee They're still pushing out new things the standard But they're just not supporting them because it doesn't fit into their ecosystem Perhaps if we read some type of critical mass and have some more adoption it might force or encourage these vendors to Keep keep keep up to date with the latest standard So, you know, if the whole idea of programming in open CL is to encourage Cobra use and write code once you don't want to have a diverge I change is pretty significantly Okay, and there's more bad news Each vendor has they're adding in the latest standards of open CL. They're adding more and more specific device extensions things like Specific atomic operations and and and weird precision like halves and doubles but only for certain cards And SPI SPI is pretty cool But if everyone's involved in in using open CL we might might might encourage we can determine which Device extensions are useful at the moment. I don't think they're getting feedback from the community But thankfully open CL can be Well, it well written open CL code can be faster than open MP on CPUs because it doesn't have a runtime It doesn't leverage as Complicated runtime. There's much overhead is open MPs So that's that's absolute time Number of threads actually corresponds to the number of cores and so this is on a z on x 5 650 and it has Six physical cores 12 hyper suited cores So we see for each core we add that the time gets better until the the peak performance is at 12 cores, so it's I'm not going to go through that with each each device because you need to know all about the devices And that's not what the talk is about So we can see because we're not having to manage an open MP runtime layer Open CL code can and can can can Get more more raw power out of the device And thankfully it's comparable to CUDA on GPUs Which you know, it's device specific to only NVIDIA GPUs. So this is just from Sobel image filtering benchmark. I didn't write it and I didn't generate the slide, but This is quite a bit of literature about various benchmarks where open CL and CUDA are parity on parity It's a bit of a myth that is that they diverge And it's the way forward for digital signal processes So this is the Keystone 2 that I mentioned earlier. It's out by Texas Instruments. I think it got released the start of last year the mid 2012 but it's still still pretty new and So they're using open CL Predominantly as the development environment So so the architecture is it has a quad core arm a-15 and eight DSP cores and this is typically Incredibly energy efficient too. I think they only drain about 15 watts in top for entire system on ship very cool devices And it's all it's all shared memory now this type of device is is well It's required for open CL computation and so I'll get into that now So why the host is so boring Okay, so open CL so So this device is is well suited to open CL because you need to have a host program running and a kernel and the kernel is your computationally critical region that you've already identified It has to be written in a separate Program and the programming language for that is a subset of C. It's a restricted version of of C99 and It has to run on this device You know, but the problem is you need to have a host device that can figure out. Oh You want to run on this device? Okay? I need to compile this this program for that device and then I need to work out what memory buffers You want to use and then I need to push the memory buffers onto the device then I need to pull the memory Look tell it to it. Hopefully it doesn't crash and then pull the pull the results back But that's pretty boring Because it's it's it's it's often it's the same thing again and again If you want to leverage any any device you have to go through the same set of steps So we have a platform layer, which is the bottom which basically it isn't part of the runtime API But it queries what devices are available and which platforms you want to run on and then for each device We have we have a context and then there's a command queue and the command queue are like the order of instructions That we want to execute on this on this accelerated device So with so in the platform layer with we've picked a device and we get a context on it in the runtime layer And then then everything additionally is you know, we allocate this buffer. We put this buffer on the device We generate a program And and so on so it's boring but This this this this keystone 2 is is it emphasizes Open CL well in that it has an arm a 15 CPU So that's that's four cores So you could have just one core running as the host or you could have all four running as the host and you've got eight DSP nodes they're separate compute units that we want to target so they're physical cores So if we wanted to have eight eight threads running we'd allocate a Context for each for each core and then execute And because it's a DSP there's the shed memory so the um, so the host can just keep swapping Feeding those circular buffers. Anyway, the host code is boring and so I thought I'd do the boring bits for you So I've taken myself off the market if any any commercial Anyone from large commercial? Or corporations I suppose We're looking at me. They're not now Because would you really employ a programmer that works this solely so remember how I'm trying to save on developer time So I've timed how long it takes me to write the host code in C C plus plus and Python So there's a pi OCO module So basically I wanted to give some type of Comparison as I'm a terrible list. I give terrible estimates of how long it takes me to implement some code, right? And I think so I think C is we see is my most comfortable language and it may might take me longest But also because it's the host side do and We can use high-level languages to create this host Can we sacrifice execution speed even even in a context like open cl It doesn't really matter on the host because we're only Creating the host is a one-time thing, right? We can develop the host the boring stuff faster We get it out of the way like ripping off a band-aid I suppose and if it comes with a slight slight amount of pain Do we really care about it because for a lot of adult all depends on the application? I suppose but for a lot of applications you can just create the host and you want to keep Continuously running this thing on on on on your target accelerated device and there's no additional cost in running the host then right? Okay, so I guess I'll look at some lines of code so I I'll also look at so I'll walk you through some code to show you how to write the open cl host But hopefully you won't have to do it because you can take my code from it hub and modify it Because it is essentially the same I'll look at the number of lines of code that it that it took me to To get each of these major open-ceal hosts achieved and then I will go into Time lines of code execution speed does does does does using this leveraging a high a High-level language impact on execution speed if it doesn't then that'd be perfect right? Okay, so here's here's my metric so So remember that slide on open cl host specific stuff So I include the platform layer in the command line passing step, which is basically I want I want the user to Specify from the command line which device they want to target So, you know, so they pass in two command line arguments The platform ID and the device ID and if they enter no no no Arguments I want to pull all of the platforms and all the devices and list them to the user so the next time they can run it They can target which device they can pick which device that want to run it on I Generate the signals, which is very much language specific and then I go through the boring open cl host side code So, you know, so we create the context and the command queue We have to load the kernel from source and then we generate the buffers and yeah, yeah, yeah And then we wait for it to finish and then we Get the result and write it to file and we clean up and I also include how much time it takes me to Set up the entire environment Yeah, and the runtime layer Yeah, so the light blue is all runtime layer open cl host code and everything else is I suppose platform layer and also just stuff that's done on on on the host because it makes sense to It makes make sense to load in all our data on the host and then send it on up to the device So we want to do the computationally intensive region on Okay, so I hope I can just switch slides Okay, cool So I'll flip through to a whoops a kernel Just to show you guys what an open cl kernel looks like. That's oh, you can't see it Is the color I wanted bigger How's that? bigger bigger Might might is that okay? Everyone say that okay? Okay, so here's We could be okay, we'll look at this kernel. It's not not not the cleanest code so bear with me So we can have we can have Functions that we can call on the on the accelerated device and that's supported on all accelerated devices, which is pretty cool So we can have alternative functions But we also have to have a main method and that's that's always determined with this kernel argument It's like a hint and then we have this is this is the function name So it's it's a continuous wavelet transform function and then we have our memory buffers that we want to pass So anything with global that's globally accessible from the host And I'll show you when I when I want to go to the What it looks like on the host side when we create it But if you just remember this signature, so we have global and we have it cons So it's read-only and we have some arguments and then you know, so we have a few of these read-only memory buffers and then we have a right result that we want to slurp back it when the when the computation is done and We'll just skip that. We'll just skip the pre-processed out region, but if you look at these Basically it's getting the position in In In our range of work items and it was mostly discussed Last year and I don't think I have a lot of time to get into it But basically there's if you have questions about it, let me know and I have the slides at the end so we can go through it But but basically there's we specify an array Of how much stuff we want to do and that's the global global Global array global size we want to work on and then there's a local point that we can chunk it up into small bits And then given all this work you want to do in total I'm Get global ideas. I'm I'm this thread or I'm this core and I can work within this region So so it's it already the open cell runtime already petitions. What's available to us? If that's not clear I have slides after and I can explain it better Anyway, the computationally intensive region has a few, you know a few nested loops does some computation and then stores the result back Okay, I wonder if I could view all three at once Is it gonna be big enough do you think? Okay, so on the left we have Vim I Can't see I know we'll have to go one at a time. That's a pain Yeah, that's not readable at all is it Okay, so we have the command line passing in C. Anyway, so, you know, not nothing revolutionary there same old code To use the open CL Runtime we have this get platform IDs, you know, we we we pass in a pointer and Basically if these arguments are null then you have to read all the API and that's quite time consuming If these two arguments are null then it'll just return how many platforms are available And then we do some error checking and create some memory But and we have to do see things to This is just so this is just the code for if the vendor if the user didn't provide any arguments It'll polar polar platform work out what devices exist Collect the buffer print the buffer There's a lot of I mean there's a lot of code there them that to just print Okay, so so that's that's all done in C but if we This is to do the exact same thing in C++ It's not terribly exciting It's a little bit easier to the the C++ wrapper API doesn't make it does make it a little bit easier And we have to write a little less code to extract these these device specific Strings so we can should display them the user all terribly not terribly exciting. So I'll skip through that Basically, and if they provide the right arguments we have our target device in our carbon target ID And that's basically it for the for the platform API But then we have had the runtime stuff to do the exact same thing in Python And I have to fess up. I only learnt Python last last month So maybe my search times aren't a fair metric like how long it took me to to because I had to Google everything but That's to do the same thing in Python So so a lot less code And I'll get to the metrics later on but I might go through to how we how we get Basically now now we're in the runtime API the open seal runtime so we get the device we're interested in and Create a context on it To generate the input signals that's that's that's application specific So you'd have to change that to whatever you wanted to run But for this it only took four lines of code to generate the the arguments to that kernel you saw before This is just to generate the data on the host not on the device yet So I'll get to that soon and it gets a little bit more interesting but again, so four lines of code in Python to This is in in C++ I have a gripe about the C++ open seal API and that's don't use it Basically, we have to have nasty bits of code up here So we can't use standard vector. So that's the whole point of using C++. I imagined The new like C++ 11 with the new vector arguments You can't use any of that and you know, so we're back to using pointers And then there's no point in moving to C++ from C. I thought I have nothing against C. It's just Yeah All right, and then we generate this is the jet do the same stuff as we saw in the Python code generates a couple of arrays and populates the values and to do the same thing in C Is about the same length as C++ because we are just using pointers Okay Now for the the host runtime layer we can get through this pretty quickly. We have to create a context There's a lot more variables. So you notice there's a lot more variables to manage with the C API then if we looked at higher level Language so so basically you see plus plus deal puts them on their stack internally in in their in their specific In their wrapper, so we don't have to worry about all of that and That's to do the same thing with pi open cl which is the Python the open cell module for Python Two lines of code. So you get the idea. It's sort of There's a lot more code involved in writing the C implementation, but at what cost does that have to the vendors to the developers? So we need to push we need to set up the memory We need to get the memory somehow from our host now I've generated it to the device and we do that with these open cell buffers So we create these buffers according to our context on the device and basically we pass in the host pointer And it does a does a copy when it creates it for us And it can be blocking or non-blocking as well I'll skip through that and then to I'll just go to the C++ Version I'm just have this text so big. Um, cool. And that's just do the same thing with these buffers and with C It's a little bit longer, but not too much If you want to know what this this This pre-processes pre-process a macro. I suggest you just watch the talk from last year And so okay, so we've we've load we have to load the kernel From from file. So so our host is responsible from for pulling down this this this this buffer of text that we've written Yeah, in the subset of C kernel We want to run the device and we have to compile it specifically for that device we want to run on and then execute it so that's what build program does and then we create the kernel and then This is this is to create those memory buffers that I was talking about before in the Python version And then we have to set up these kernel arguments and then So that's that's beat. So these these kernel arguments are well We're obviously providing arguments to the kernel and the API isn't isn't terribly nice. So we have The kernel we want to run on The argument index then the size of the argument and then appointed to the argument Which which makes sense in C to do the same thing in C++ a little bit easier In that we don't I mean it provides it keeps a hold of the kernel context behind the scenes for you like I said It exists on the stack. It's not So we don't have to explicitly deal with that memory ourselves. So that's one of the perks to using it but to set kernel arguments with Python It's all done for us. So that that that Sets sets the kernel arguments and executes it in one shot similar to how you would call a regular C program a regular C function And then we wait for it to finish with this. We have our command queue and we wait to finish Dear We have this we have our command queue we execute it with this and QND kernel kernel arguments and then we we wait for it to finish and again in Q and Wait for it to finish Okay, and then we read select the results back with this in Q read buffer and then we write it to write it to file and then we have to File and Free up our memory resources. So we have we only have four explicit things. We have to free here with the C++ we have 14 and with Python we have none So I'm quite a Python convert at the moment. I haven't found anything. I don't like I should have changed sooner but I Think that's it for boring blobs of code for us That's so I just wanted to give you a taste of what open-cell boring mundane programming looks like it's a lot nicer with the Python version. I hope you concur just from not not not just just from glimpsing of the code But that's that's the time for how long it took there are a couple of outliers Basically the install time for Python the py open-cl module was was the killer for me. It took took 40 or it Yeah, took 45 minutes just just to a both search time and and Install time to get to just just for different dependencies But once you get it running, it's it's it's pretty cool So the command line passing as you saw them basically more lines of code meant more time No, no, no, no surprise there But we could do the same same So I want to say I want to save you guys time. So We could do the same thing in Python from in say say for the command line passing in 33 33 minutes versus 70 76 minutes on C++ and 123 minutes in C Although that's a bit biased because my C and C++ should be a little bit closer because I implemented the C version first Yeah, so The C version took a little bit longer to implement and the Python one was pretty cool but for lines of code so that's that's the number of lines of code for each each portion the I think the results that speak to themselves there, but Yes, so in total it took five hundred one lines of code in C 255 lines of code in C++ and only 118 lines of code in Python But the running times Not that different Really, this is a very specific specific workload So we have wall clock time so so I mean you expect to have Python run a little bit slower because of because of that has to because of that. I'm not using pi pi or Because the bindings right So I've also checked with different different workloads. I haven't got the slides here, but That's a one-shot cost usually the the the additional overhead of 200 200 milliseconds is Is is a it's a constant thing what Because we because it has to go through that additional API step to hit the device So concluding thoughts is I should have learnt Python sooner But you'll never have to start from scratch And I didn't want to go through the exact same example again But to give you an idea of just changing the regions of code that you're interested in so for your Depending on your application. You would obviously generate different data and and in passing different arguments So if you use the code I provided it took me eight minutes to write a new completely new OpenCL program from scratch It was about 20 minutes in C and about the same in C++ So I hope you guys could use that That's the boring part of the looking at code taking care of So I do it like to do the recap now How am I going for Cool I better rush through it then sorry I didn't mean to prattle on too long there and bore you but so there's the open MP 4.0 and open MP ACC accelerated rectives, which was what I spoke about last year in the context of that talk it was Have as an ever has anyone used open MP? Yeah, if so, I think it's pretty cool You don't have all of this host stuff anyway, and I think that's the future of open CL So open MP 4.0 already provide accelerated rectives So the idea of open MP and you can go back to my last last year's talk was You have your so you have your computationally intensive for loop you can chuck in a preprocessor macro around the top and say hey this is a computationally intensive region and it'll map down to in the context of regular open MP a p-threads library And it'll do the it'll do all the threading for you and and so it'll do like a fork and then a join For all of the p-threads behind the scenes You don't have to write any of that boilerplate code that we just saw it's so boring I've just wasted, you know 20 minutes of your time and and and now you will never get it back Now now imagine doing that Every every every time you wanted to write an open CL program. It's not practical so I feel like Open MP 4.0 and open ACC they're two two rivals. It's sort of a divergent thing, but the idea is the same You have a computationally intensive block of code you want to run and you're only in sort of saying Pragma hashtag pragma omp parallel region parallel for you have added a little one more tag accelerator And it will map to whatever and your accelerated number and it will implement all of the host stuff for you and you know It's it'll determine your arguments and execute it behind the scenes But you still need to know open CL because it'll be using open CL behind the scenes So if you want to get the raw performance and want to do it yourself you can you you could again, it's it's a trade-off between developer time and execution model So some of the cool stuff is since last year There were only the commercial compilers for open ACC There's still no compiler for open MP 4.0 But you should check out open ACC because there's there are three new open source Compilers and They they leverage open CL behind the scenes to do that So I think we'll have to keep writing this host side code until it becomes tenable So I mean please use my provided code so you can do it fast What else has happened? We're finally getting fast Fourier transforms and basically near algebra subroutine support So I mentioned it last year the main reason why CUDA has such a high adoption is that Then kudos sank a lot of time into developing these efficient APIs to do basically near algebra and Fourier transforms fast So that we have a couple of open source projects that I encourage you to get get involved in the CL math which is There that link and there's now for relief recent CL blasts. Yeah, so see all math has some blasts and FFTs There's SPIR, but I might skip that since I only have a minute But feel free to ask me about it in question time And we're getting even more vendor support. So last year no one supported the latest standard of open CL 2.0 Now at least we have AMD and Intel they're heavily behind the open seal standard But we should try to encourage if we write more open seal code Just by sheer numbers it might might encourage other vendors to get get up to the latest standard Latest version of the standard and that's why I need you guys so So it was we're seeing open seal being fragmented with these specific extensions and they're just pushing it pushing it their own way So I feel like if if we collaborate together and work on Work on accelerating your computations and we use this common common standard We're going to save ourselves time in the future because we're not going to have to reinvent reinvent the wheel every time And when we want to target a new device by being targeted by being tied into these vendor ecosystems We can all share it for weight. So we wanted to get behind open seal and and get involved So thanks