 Yeah, thank you for coming everyone. I'm going to be talking about accelerating fiscal My name is Joe Nudge, and I'm a developer advocate at Braintree. We're one of the sponsors of the conference Not for any particular reason. I'm not recruiting or trying to sell anything. I just really like functional programming If you haven't heard of Braintree, just a few obligatory sponsorships there. We're a payment processor We help you make money and some of our clients are pretty cool We have air food, air duty, Mojang, you make mine proud. If you are interested in the payments industry Or any of the latest topic about security and that sort of thing. I'm happy to talk about it But this talk is going to be down straight for a second. So this is my first time to Boulder or anywhere in America outside in New York and San Fran I'm from the UK. So these mountains spectacular. I'm used to hills that look a bit like this and have telly-tubby as you see on them This is a fantastic place to be and an important thank you all for welcoming me here. It's really awesome In particular, I'm from Playsport Essex. As you may have already have noticed that I speak quite fast Essex, if you don't know, is the English equivalent of Jersey Shore. We have Tauwe here anyway. That's it I have to speak fast to survive, but this is a small group. So if I do speak too fast, just interrupt me If you want me to go back over something, just let me know. I'll go back over it Probably in the exact same accent and speed, but we'll try and make it work So I'm going to do quite a light session because as someone pointed out on Twitter yesterday This track is hella intense and if you've been here all day, I imagine you're starting to be quite tired I'm also really jet lagged. So we're not going to go into too much depth I'm not going to whack you over the head with a 400 line manual cross-set generator or anything like that But if you do want to go into more depth, we've got the uncomfort session tomorrow and I'm also around all day I hope we'll score a workshop tomorrow now so you can hang me in So yeah, why are we here? We're here to talk about GPGPU, which is a ridiculous mouthful and leads you to things like this immediately And what that basically stands for is general purpose computing on graphical processing units You can simplify this. I tend to say that great programs get performance upgrades Because you write them in a very nice way and then you should exploit some new resources that we don't usually use And we're talking about that in the frame of Haskell So I know this is a very mixed conference. I want to kind of get filled the room So I hope your arms aren't tired. Who is a Haskell here? I would assume it's probably all of you Fantastic. Has anyone got any experience with general purpose GPU programming? Wonderful. You two pretty know more than I do Congratulations to the bookworms. Oh, thank you And I go for Matt. Has anyone done any GPGPU in Haskell already? Okay, cool. So you're my wingman. If I say things stupid, you've got to call it out, right? So GPU, using the graphic processing unit more than just graphics Why would we do something like that? We would do that because the law is failing or Depending how controversial a feeling has already failed Morse law basically for anyone who doesn't know is that the idea that transistors on a chip will double every two years by Intel's estimates Their chip efficiency now or their transistor efficiency nowadays is more than 90,000 times And their cost is 60,000 times less today than what it was in 1971 So just think about that a minute. That's a single transistor The 90,000 times more efficient than it was in 1971. Plug it to a car And it's a stupid crystal and apply that by the cost to planes This is an insane figure and that's not even the whole chip We're talking about a single transistor on a chip which has billions of these things It's just an insane speed improvement So the fact that this exponential gain has completely fallen over is a big issue and is throwing a lot of people Currently we're down to 14 nanometers It's harder to get much smaller because you start to get massive heat things that you've got cost The decrease in the cost, which is also part of Morse law It's not just the performance increase. It also predicts a decrease in cost and the economic factor Starts to you start to lose that it starts to become less and less worth it to make this more of a transistors And when you think back to say five years ago when you went from when you upgraded from device So I had a HTC hero in 2011 you can tell it's 2011 because HTC was still relevant About that night HTC hero when I upgraded the next year to a desire that was like Mind-blowing speed upgrade. It was it was feelable. You could notice it Whereas nowadays when you upgrade from One phone to another or from one system to another you just kind of get that You get new features you get new designs you get new software But you don't really get the notes for speed improvement Typically the biggest speed improvement you can get now is when you're actually changing mediums For example, the only really noticeable speed improvement you can get a laptop nowadays and they're sure still using sign from 1998 is when you go from a hard drive to a solid state drive That's when you're going from a physical spinning disk of metal to transistors from transistor to transistor You don't tend to get the wow moment now So we're talking about an incremental versus exponential increase. We're starting to get towards the incremental And this is a big problem because as someone there's been many jokes made about this That although and you're probably simplifies that although Our computers have gotten exponentially faster. They don't feel faster machine from 1995 Probably feels the same as you're using windows. I feel sorry for you anyway But the fact is that users computing is getting more advanced Their needs for computing is getting more advanced and growing more dependent. So we're making bigger software And that's taking up the resources that we're generating. So if we start generating less resources the That increase in their needs for resources isn't going to stop and that leaves us with only a couple of things to do How do we get the resources we need to continue with this problem using computers? Well, the first solution is we can all go back to basics and dump all our abstractions and go back to assembly And start pulling all my worms in the mud trying to suck every last bit of nutrients out I'm personally not into that I don't think many people are driven to that you're a functional programming conference you're obsessed with abstractions The other option is to go big to build more data centers to start off loading local Computations to the cloud and doing them in bigger machines in bigger places This quite frankly is destroying the planet quite a lot of big manufacturers google etc I'm really responsible trying to be really responsible with their database design Computing center designs facebook open sourced They're efficient data center design and he's trying to push that forward as a standard You save energy to save water etc, but it's not going far enough and if we keep going this way we'll soon look like that So we have another solution Another device is now we have A lot of unused resources One of them One of them is specialist processors. So every device now Has a wide variety of processes GPUs or one of them We call these heterogeneous systems So these are devices these are systems that are using more than one type of processing chip Definition Wikipedia is Typically worthy to expect But you can game as it says that the important that in game formats But what by adding the same type of processors but by exploiting specialities in the different ones And GPUs are optimized for a task and that task is obviously video and graphics processing But we can I think I made up work here. I don't know that's a real work. We can genericize this task We can use that we can take what's special about that task and assign other places And then you can identify that in your general computation So what's special about the computations that graphics processors do is that they're single instruction multiple data we're talking about Like what graphics processors are optimized to do is take a large set of data and apply a single instruction to it over and over again This is a type of parallelism data parallelism. We're talking about having Huge data structures and being able to apply an operation to every other than data structure all at the same time And that's possible because GPUs of hundreds of cores are optimized to do this if we talk about one of the architectures Videos tegra They have 192 cores in self architecture That is an insane amount if you're talking about CPUs where you've got like eight cores Yes, you can do more than one operation at a time on the map for example But this is an insane increase in the amount of resources we're able to use for this task And you do that in parallel, which is not one of the really important things. We're not really exploiting the number of processes We have Increasing the number of processors one of our any options and we're not really using it. That's because it's hard. Well, frankly if you think about Going back to low level abstractions and you think about In the 80s or 90s when we use all using C in assembly Doing stuff with low level memory was dangerous So we've got abstractions to stop doing it when you talk about concurrency and parallelism There are things that managing the memory across multiple processes and across these data stores that we're doing And we don't have the abstractions to make them Safe to make them not dangerous and it's a lot of hard work. So there are some reasons why we haven't been able to exploit this properly But if you're sitting there thinking That this sounds doable now that you recognize what this might be it should do Because of course, isn't a man We're taking a data structure and we're doing one operation across every element of that data structure So why don't we all be using this program? Let's say it's quite hard in Typical in the actual use case of graphics. It's because there's been no support for it Up until about 2001. It wasn't actually possible to do this at all And then after 2001 you could but you had to reformat your code into using the graphical primitive So it's no one really wants to do because that's a horrible But then someone made kudo and opencl So these are actually two different kind of competing standards kudo is proprietary made by nvidia opencl It's a local source solution. It has a driving committee, etc And they expose apis and they expose the primitive operations and they expose the primitive data types So that we can start exploiting GPUs generally about this program But fortunately They think cc was plus and four trying to high level. I'm not even joking here The nvidia documentation for this says we don't do anything that like goes to talk about low level And then says we've got high level languages and then list these three That is absolutely terrifying. Again, I want no part of that So it's up to you to be level abstractions and as haskellers especially we love dsl Haskellers love a good domains to fit language Everything we do has to be its own language close to a special and in particular accelerating loves isn't embedded domains to fit language so Onto the meter but accelerate This was made by uh It started off being made by the team down in an university in New South Wales We've actually just had yowl their conference very similar to this Really talented team manual chapter archery They do some incredible stuff And accelerate is one of theirs and it's an array So all of the data types that we use are arrays and we work over the rates So just a quick preamble. Obviously, you probably all know how to go and get household packages This is no different. Go to compile and then swear for four hours to provide three dependencies Um, I do accelerate tends not to be too bad. Um, I think it's also in stackage And what accelerate is is takes Haskell and its generates could avoid Um They do have other backends. So there is an open sale backend. There isn't an llpm backend They're working on some crazy ones. I think someone made like some insane prototype bash backend for some reason. I don't know I did But for testing purposes, obviously if you haven't you do you want to play with this You don't have a GPU or you just don't want to have to offload zone to a GPU all the time. They also have an interpreter Um, if you don't get to be gained, maybe just want to experiment with it if you're doing this now while we're talking That's when you want to use it. So once you've installed it Typically just firing up you just import as normal. We do qualified imports here because as I've already hinted at with the map Some of the operations available on accelerator arrays do clash with the predict Um As said all computations are raised So raise and accelerate are a bit different from what you might expect They have two constructors for a start Those constructors are a bit weird I'll go into why in a minute The array type is based on shapes and elements So What we think of shapes, which will be dimensions So if you're thinking in the range, you might think you have two dimensional array We have one array and it's in another very similar. So we can't net arrays We have this special array constructor, which tells an operator, sorry You can accelerate. We have this special shape constructor, which tells arrays Tells you accelerate how to deal with the other season it raised In elements Again because we're working on GPUs, there are some restrictions what we can use GPUs are only optimized for some kinds of computations So we can have various whips of ints Clothes and tuples one important thing to mention here is when using CUDA It's fairly standardized across all the invidia chips. We're going to get some of the performance If you see open CL backend Open CL is available on a much wider range of graphics processors It's like it's much more widely applicable But because of that they're not able to deal with device by device optimization Some graphics processing units work better at different inputs, for example, than others They should perform better So if you're having to use the open CL backend and you're seeing that you're not getting the performance Upgrades you'd expect it's worth going and checking the data types. There's something as simple as that So when we get to the instructor for an array You can see we've got this, as I said, we've got this shape argument and the element argument And when we actually want to construct a shape we start to get into the witness. So This is our first constructor. So shapes of two constructors This is a zero one. This indicates an array with no dimensions. So this is a scale. It just has one element When we get to our next one we want to start building multiple dimensional arrays As a typical haskulla we make up a random infix symbol I haven't thought of what you named for this yet. If anyone has one please shut it out Or we use this And this is the witness I've mentioned. These these constructors are both type and value constructors So this can get a bit confusing when you're reading the docs and you're like wait a minute that Why is that that? But It's also quite handy in that you know exactly What the symbol you're using should be doing So the covered no dimensions scale up So this is If you wanted to create a one-dimensional thing Just use z if you want to build up what we Wherever funny symbol z and an in So this is one dimension and one dimensional array index by in we can actually only index by in so making it explicit So isn't kind of sending to me. Um, but it makes more sense because that constructors of course value as well And we need to index by we need to use that later to signify the length of that dimension um, so this Yeah, so this one we could learn that a vector So if we want to build up dimensions We just add more dimensions on so we're going to index in the next one by in so the important thing to note here Is that this is left associated? This part is also on the left That becomes important later when you talk about the order the dimensions are applied So this obviously a two-dimensional we call this a matrix Um, believe you're thinking like right now the building complex arrays and complex structures out of these Is going to be a bit painful. Luckily some helpful parts in those are provided. We've got the dimensions here on the left And on the right we have the scale up vector Of course you can define if I did this issue one So when we're playing with the interpreter, uh, we typically aren't really using the accelerate yourself Accelerate providers for operations for us to let us make these arrays from From data structures more familiar with such as lists. Do you have these bottom lists? operation You have to put the type There otherwise you get horrible levels And again, as I mentioned, this is the value construct as well. So you have a copy of this So this is what the result will that'll be on the right? And we have a dimension of one dimensional of 10 And then obviously we have the elements that way So if we were to do some two-dimensional So if we were to add an extra dimension here What would happen What we would get would be a grid Yes, if we were to do for example Z funny symbol three and then five This would create a multi-dimensional array, but it does it from the right So in a two-dimensional array, you've got your x and your y dimensions obviously the five the right most one would be our x And the inner one would be our y that's not very intuitive That's just how it deals with it, but what do you get out of it? so in this form The resulting data structure wouldn't reflect that it doesn't create a nested list or anything like that What accelerate does on the back end? Is that this constructor tells accelerate how to deal with the indices and how to treat those it doesn't actually reflect That dimensionality in the data structure So as I said, this isn't really taking place in accelerate dsl to do that and to actually start generating code We need to use run So run takes some arrays Or rate run, so it's using pinkie run acts in the context of accelerate And also it's taking someday at some array or is it going to another array So if we have an array and we want to put it into the accelerate context, we have to use use So this is a simplified data type of used. We're going to the full one in the bit But that just doesn't you expect you have some arrays and it puts them into the accelerate context So if we look at familiar operation and how that works for example map We have here some slightly unfamiliar type classes at the top. So shape we've spoken about ELT, this is an element Obviously, you're going to have arrays and you're going to have elements So one of the things that accelerate does really well Is to make sure that your code stays data paralyzable so that it stays in a flat structure where you've got Multiple pieces of data that the same operation can be applied to if you were able to nest arrays For example, you were able to have an element that was an array you would break that property So element the things that go inside the arrays Are slightly different to how arrays are treated. We then also have expression This these are operations that act on an individual data element of the array instead of the array itself So we talk about accelerate These are this is the context in which operations run on the array and then exp that's running on the individual element So when we use map We have an array which we've defined before using probably from list or some other constructor We use use on it to bring it into Accelerate We're performing our map operation across that and then we're running the whole thing So that will go off and that will produce the accelerate Whatever whatever accelerate option whatever back end you're using so run is actually exported by all the back ends It's exported by the interpreter. It's exported by kudo. It's exported by open cl And that will go off and produce what you'd expect And then we get our result So I've kind of would covered elt. So This is the more complete type for use As you see here, this is actually making the use of elt explicit so elt The element obviously has to be The second part of the array because that's what we'd expect but we're doing by doing that We stop ourselves putting arrays into Our elements and stop us doing multiple dimensions And same with esp. So as I mentioned array is a computational array and exp is a computational element So I want to produce a unit for example So I'm going to close this door. I keep kicking we have Through just a unit computation. We'd have one element. We're applying computation to it and then putting that Into an accelerate context and wrapping that in a scalar array. So that's an array of just z and an array of no dimensions So so far Rowan you really showed you how to create arrays kind of from from this constructors and then pulling them in using use but obviously That's become a similar we can create arrays actually inside the acc directly. There's a lot of ways of doing this I'm only going to show you the simplest one. Oh, hello. Just press the screen I want to show you the simplest one and that is fill So with fill As it says it's filling an array up from some construction data so We have an operation that takes place over a shape We have an operation takes place over an element And then we're moving that a resulting array into accelerate So how you typically do this with the fill is that would be for example The value constructor applied to something so it would be a z applied to Your dimensionality or you could do that dynamically using some other clever code Then you want your operation that generates the elements And we just wrap that and accelerate so we can see some dot product. Don't actually think I have fill on here Apparently, I don't show you how to use fill. I just go straight into it Sorry that so Having to go through that and the more complex ones later. Um, so yeah, this is dot product Very simple operation. This is one of the most simple operations you'd want to perform on a gpu Here we're just taking two vectors I'm whacking them into the accelerate context. You can see we do that with these two users here so This is kind of using that external dsl rather than the internal one as we did earlier We just take our two vectors apply use to both of them and then we can use fold So the operations you have on your arrays are very similar to the operations you'd expect to have in lists So you have folds you have zips you have map And then once we're doing things like this we can upload to the gpu So I think that got to about 25 minutes. As I said, this is very light much shorter than I tend to be With barely scratch the surface of what this library does Every time I use it I get bitten in the face by some new functionality. I didn't realize was there It's a very deep and huge api. Um, it's also very similar to wrapper, which is another accelerator another array computational library Um against the dsl made is the api is basically identical But this is more applicable to things you do on your cpu Some really great resources for learning more about this anything by Simon Marlowe Simon Marlowe wrote a fantastic book It's in there's a chapter on this in Parallel and distributed programming in haskell He also did a talk on it And then there's been more recent talks. There was one I think it was Kind of where it was there was a there's a couple on youtube from a couple of conferences So yeah, sorry that came into time If you want to come grab me and talk to me about anything I've been Joe Nash. Thank you very much