 So this is the problem basically around 2004 so computers stopped getting any faster because this continues to be true Now what they can do is they can keep on cramming more and more cores into the same on a space So we're started seeing dual core laptops and quad core laptops and GPUs and so forth But the reason why this flattening happened was physics was thermodynamics This is a plot of the energy dissipation of a CPU as a function of time In terms of watts per square centimeter the amount of heat that a CPU dissipates and a few common common items are noted Along that plot so right around 1998 a CPU was basically about as hot as a stove Right around 2004 they were becoming on the order on the order of the heat densities of nuclear reactors and by 2008 they were reaching the heat dissipation needs of the surfaces of a rocket engine Basically that when they hit that limit they had to stop making them any faster because we we simply don't know how to manufacture Materials that can sustain that those temperatures. They anything we make will melt at those temperatures That's why on a modern CPU if the fan fails the CPU has about a second Just to shut itself off before it damages itself one second If the fan that that is attached to a CPU isn't actually blowing if something damages the fan CPUs have thermal fuses internally that will shut try to shut them out instantly because within within less than one second This will actually cause physical damage to the circuits and they're packed so close together that that they'll they'll just melt out Basically, so this was the end of the free lunch for many many years when I was doing my PhD I started my PhD before this and we were still operating in the mode Well, go to slow who cares do something else and wait two years the next generation of computers will be faster And you'll just run it again and it'll be not that's just gone now They're giving us multi-core chips. They're giving us GPUs clusters clouds etc. Etc. We basically are being forced to look into Using parallel resources whether we like it or not because waiting for the machines to get any faster It's just not gonna happen and there's no solution in sight No one has come up with anything even since these slides were actually from the mid 2000s and in the intervening eight years or so since some of this I mean this one had numbers out to eight but but these discussions along these lines started around in the early 2000s and since nothing nothing of use has happened, so We talked already about We talked already about Sort of the the complexity of an algorithm and but before before we start looking at using clusters And seeing and seeing where we can put our codes There's one important piece of the puzzle that you do need to understand And so I want you guys to try for a couple of minutes a short little exercise Which is the following to derive and plot what is known as Amdahl's law So Amdahl's law is an upper bound on the speed-up that you can achieve when you parallelize a Code okay, so you want to ask yourself, okay If I had imagine that a that that you're writing a code and The total time it takes if this is the total amount of time and running your code takes there's a certain fraction S of that time which you can't parallelize at all right You simply can't parallelize that because maybe it includes reading an input file that you only had that that That you only have access to on one machine or opening a database connection or computing some quantity That has to be done in a way that you don't know how to parallelize and then the rest of the time One minus s right this fraction of the time you can spread out to all machines So what you what you want to what I want you guys to ask is Find and imagine you have P processors total so you're this fraction You're going to run this on a computer that has P processors total and you want to ask How fast can I can it go so the total runtime on one? Computer will be basically the sum of these two right so it'll be one scaled out to one And now imagine it would run on P processors and that you would do the job perfectly right? You would parallelize the job perfectly how much faster can you make it as a function of P? It'll take you just a couple of minutes think about it for a second And and if you can write a little Python function to plot it and we'll see the solution in a minute Yes Excuse me no no no no no and in fact in practice in Practice there they're actually it's an aggregate right? I mean in practice what happens is you have a little bit of serial work You have a little bit of parallel work serial parallel serial parallel, but this is the aggregate count so the cumulative Fractions of serializable part and and parallelizable part and imagine the parallelizable part Parallelizes perfectly that basically it goes perfectly to all P machines and there's no overhead of communication The simplest analysis you can do Think about it for a second and try to plot it and we'll see the solution in a minute This is kind of a pencil quick pencil on paper slash plotting exercise Does anyone have a fit a plot? Give you guys one more minute, and then we'll have a look at the solution What you want to plot is basically as a function The potential speed up as a function of the number of p of the number of processors for different values of the serial fraction and see See sort of how how it compares So let's have a quick look at the solution so if you have If you have one processor The time is going to be the sum of these two times right so it's going to be one in whatever units it is one hour It doesn't matter because we're going to do or make a route do a relative measurement And if you have p processors the serial fraction still takes s right because that's the part you can't parallelize by definition The parallel fraction takes the remainder piece, but that will spread out among p processors, right? So it's that over p and then the total speed up that you can do is this Which is always less than the inverse of s you can show very simply that this is always always less than one over s So the serial fraction basically is a really hard limit and you would be surprised at what it looks like So a little bit of here's a little bit of code. I'll put these slides up later So here's a little bit of code to plot it But let's have a quick look at the plot if your serial fraction is a half if you can only parallelize half of your code Look at what happens as you add and this is a logarithmic plot So I'm here. I'm scaling the number of CPUs as powers of two not linearly, right? And very very very quickly this curve flattened out completely right basically at from this point onwards Adding processor doesn't give you anything because you you top out very rapidly Now the upper curve is a serial fraction of just 5% so 95% of your code paralyzes really well And still it shows you that the maximum speed up you can get is a factor of 20 And you get that with 2 to the 10 processors, which is a huge amount So basically Understanding how your code parallelizes is very important because it tells you what does it ultimately? This is the resource you pay for right somewhere in some way Throwing more CPUs at the problem is gonna cost you or your advisor or somebody money and resources, right? So is it worth paying for it or not? So before you say I'm gonna run this on 10,000 nodes, right? Stop and wait a second and try to understand does it make sense or not? Maybe it does maybe you are in a situation where the problem really is ridiculously Parallelizable and each chunk of analysis takes such a long time and the startup cost is so negligible that it does make sense to run Down 10,000 nodes, but before you do that at least try to understand what what situation you're in because this the properties of this analysis of this completely trivial five minutes worth of napkin math Shows you that it's very easy to waste money when paralyzing code so Kathy Yelik from EECC on campus very who's One of the persons who has done the most thinking about parallel computing in the world has a very nice summary of sort of What are the key issues that you have to keep in mind when you think about paralyzing code first is finding an off of it because of what we just said you Second understanding. What is the granularity off of your work, right? What are the pieces that you chunk up and you send out the parallel resources and that is completely none of these Are just guidelines and things for you to keep in mind. None of this. This is really much more art than science Because you want small chunks so that you can optimize the use of many resources But they have to be big enough to hide the overhead of set up and communication and whatnot So it's always a balancing act locality locality means that typically You have memory and and and access to data which is fast but tends to be limited in size and Larger pools of data that are that have more data, but are typically slow the classic example of that isn't even in your computer Your cache your your CPU has a little bit of memory built into it internally that you can't even access Explicitly which is extremely fast straight next to the CPU. It's extraordinarily fast But it's very very small and the processor actually manages it Then you have your own the memory on the computer that you have maybe two or four gigs on a modern laptop Which you can access and then you have your hard drive which is itself maybe a few hundred gigabytes, right? But it's much much slower, right? And you always have to think off this when you do parallel computing basically where is the data and You basically need to bring the computing closer to the data because moving data from memory is very expensive and we're gonna And then the cloud the same exact thing happens You always have to worry about putting the compute the computational resources close to where the data is because typically data transfer times will dominate anything you try to do You have to Find ways of balancing the loads if you're spreading out competitions and not all of them take the same time You have to worry about how do I balance the load so that I don't have as I'm paying for a thousand nodes And in reality one of those thousand is working in the other 999 or doing absolutely nothing guess what you're still paying for all thousand of them, right? Community coordination communication Often communication is the most expensive part of solving a problem, but obviously getting the wrong answer very fast is not enough So and very importantly You whenever you start optimizing parallel code you have to measure Your intuition is really really never going to be very useful when doing parallel computing Optimizing serial codes by by eyeballing where the where the slow parts are is a bad idea already in apparel scenario forget it It's not even worth trying because because the the realities of how communication and In computation overlap are so hard to eyeball intuitively in anything But the most trivial cases that you really you have to try to measure to basically measure it before you you do anything So for those of you interested there's a very nice I'm not gonna spend time on this article called the landscape of pearl computing research of you from Berkeley that if you're interested in this topic You should read it's a white paper that was written a few years ago from by Berkeley They sort of analyze what are the main mathematical ideas that are parallelizable and the main classes of algorithms that are sort of generic ideas A cross scientific computing that are parallelizable So what about Python well? We have multiple implementations in Python of the virtual machine and Python has threads but we're going to focus on see Python which is the one that you're using and The the take-home message when we're working in Python is that even though threads are supported in Python for you guys There are very limited use Because Python has something called the global interpreter lock the gil the gil Which means that only one thread at a time can modify Python data structures So you can't have 10 threads or say for threads on your CPU running But only one of them is actually able to modify anything visible in Python So you would ask well, what the hell are there are the threads in there for if they if only one of them can run at a time It's kind of silly to have them right can somebody think of or a good reason where they are useful Even though only one of them can modify Python visible data structures at a given time The problem that's perfectly correct the problem that they have to make their own data not in Python Because the only one of them is allowed to execute any Python at all. That's the problem So can anyone think of a scenario where they are useful? When you're waiting for data coming from somewhere else So if you start threads that are waiting for data to come from the network for example They work very well because those libraries have been written so that once they start the buffer in which they store the network Data isn't visible to Python until some other operation has been done But in the meantime they can continue receiving data so you could imagine writing a web web crawling Process that uses for or even eight threads which basically spawns ten connections using ten threads to different servers and waits for the data to Come back and because there's so little CPU involved Python will switch the threads one at a time And as they get data in it'll be aggregated But they will all be doing the in the background they will be buffering the data Okay, so that's about the only scenario where threads are useful in Python It's the reason is simply Simplicity of implementation people have tried to remove the gill from Python from see Python And they have succeeded in removing it but at a price of a major slowdown of all serial operations Removing the gill a gill is extremely complicated Java doesn't have it the implementation of Python running when Java doesn't have the gill And that but that's because the Java virtual machine son invested probably billions of dollars Paying people to find ways of locking in an extraordinarily complicated way all throughout the Java virtual machine So that instead of one big lock there's locks for every data structure everywhere And so that these locks these locks are acquired and released over time as threads are switched And they protect different pieces of the interpreter without having a single one to protect the whole thing But that is a very very very complicated thing to do and it took years for Java to get there with hundreds of engineers Nobody has had the resources to the to do that for Python and people who have tried with limited resources have always failed They've been able to make it work, but so Catastrophically with such catastrophic performance for the normal case that people simply say we're not willing to do that So the solution is to You can do In-process Parallelism in Python, but as I as we said you have to mind the gill you can use libraries that are threaded themselves and do data Parallelism for example, NumPy or stop it can be compiled against the threaded version of Atlas There's a library that I'm going to mention nummy Xpr briefly, which is like a numpy VM It's a little very elegant hack just for numpy expressions, but that can actually work very well There's a library called fiano that I'm not going to go into but if you guys are interested if you have a CS background you may want to look into it It's a very interesting project out of a machine learning lab in Canada and then there are GPU solutions that Paul is going to mention later and finally you can write your own threaded code And you can do what you were suggesting which is basically write a bunch of threads Do your work over there and then come back to Python But the problem is you have to do all that code and pure see by hand with Python not seeing it So it's it gets really cumbersome and complex or you work in multiple processes Where so because now you're using multiple versions of Python that it doesn't matter each one of them has a gill and you send messages and data back and forth and Python has two modules multi-processing which is built into Python itself since Python 2.6 and a new one in Python 2 3.2 Called futures, which is a little bit higher level. I mentioned so you know about it But it's it won't matter for you for now There are other external libraries and we're going to see what I Python offers here I'm obviously biased we were like Python, but we think it's it does it does a good job Multi-processing is built into the language The calling syntax Follows very closely that of the threading module, but it uses processes instead of threads The idea being if you if you were doing something with with threads or you knew how to do something with threads Or you already used the threads then it should be easy to switch to multi-processing And what and it and it does expose a number of useful things It exposes the concept of a process and a process is simply something that you give it You give it a function as a target you call start it goes it starts running and at the end you call join To wait for it to finish and so you can start in a loop Malt you have you can have a pool of these things and you can pass them functions to to execute and And the documentation for it is fairly good So I just want to point it to you guys and then in the homework You'll have a chance to play with it a little bit With numpy it's important to mention. What do you get with numpy out of the box? Not a whole lot You can link Numpy and sci-fi against the multi-threaded Atlas or against the math the Intel math kernel library Which is itself multi-threaded those of you running and thought e pd are using that because e pd ships with a version of numpy and Sci-fi that are that are compiled against Intel MKL So at least a lot of the linear algebra will be done in parallel if you have a dual core laptop or a quad core laptop The linear algebra parts will be done in parallel But that's about it beyond that you can write manual opening PC and Fortran threaded code. It can be done. It's a pain It really is a pain Nummy xpr is a nice expression compiler for numpy. It's something which takes expression numpy expressions and Chunks them up very very carefully analyzes how to chunk them up and then sends them to multiple threads It uses less temporary data structures and it can be very cache friendly The way to use numpy xpr is fairly simple you import it It's just a python library and then you can say evaluate and you give it in parentheses expressions And this is this has the same effect as doing a plus one But it will do it behind the scenes for you using its own machinery So if you find yourself writing a code in somewhere You have an expression that is your bottleneck and it's just a bunch of numpy algebra This may be worth looking into because it's so easy to use you basically switch You change your code from writing it like this to calling evaluate Quotes and that's it and otherwise the code remains the same and it can it can actually for example You can control how many threads it uses so in this case the original expression was using Was taking 36 milliseconds and with four threads on this machine It took four milliseconds of speed up of factor of eight The reason why the speed up is actually greater than the number of cores used is because not only is nummy xpr using all the cores It's also using them more efficiently It's not creating their numpy when it does this creates a bunch of internal intermediate temporaries So when numpy evaluates this expression, it does a times b Stores that then multiplies that times two stores that then it does be squared Stores that somewhere then it adds that in this other thing and stores the result of that Then it does a squared and stores that and then it adds these two to give you the final answer So there's like six temporary arrays that are created evaluating this by numpy and those are avoided by nummy xpr Yes question Nummy xpr. I mean if it's fast enough if numpy by it's still Because because you're forced to writing your code inside of strings Right, we don't have access in Python. Python is not lisp in a way We don't have access to enough of the compiler machinery to transform how Large expressions like this are evaluated. So a library can't do that in numpy It can be done in C++ with template like expression template libraries such as blitz It can't be done in Python by default. It forces you to write your code like this inside of strings That's the main reason why it wasn't but people are beginning to to play with the idea of making an extension to numpy Such that you could write code like this and and then what it would return would not be an array But effectively a delayed version of this and at some point you would do something and let's say you call this thing C And so at some point later when you needed you do C dot compute or C dot evaluate And then it would trigger the actual computation so that it could store these trees of expressions and understand them better In ways that are that it can't in the normal way So people are looking into that actually and and I think there's a good chance in the next few years we're going to see more work going in this direction and I Python What we try to do is to make it possible to do To paralyze your codes retaining the interactive feel of I Python of being able to to type some code to send data To your engines to make them do some operation to get data back from them and we think that the APIs Are succeeding in that regard we we would like We would like the easy things to be basically very straightforward and still give you enough rope to do complicated things possible and Importantly we want to make the process of doing pearl computing be possible to do in a collaborative manner so that you can work with Other people on large-scale problems and we're going to see a little bit later How we how we achieve that it's basically an interplay of these APIs along with the notebook We have a very dynamic model for load balancing that makes it fairly easy to effectively do automatic load balancing in many codes It I Python by its nature will integrate with other tools and but importantly It will integrate with existing threads and MPI libraries so that it's possible to run an I Python cluster to do and do Parts of the coordination of the execution of your code with I Python while your engines themselves run MPI code Because you may have a piece of your code, which is done by MPI So the basic architecture is that instead of now running a single version of I Python that you type code into you run n Versions of I Python that are all behind what's called an I Python controller And so this group of I Python controller plus I Python engines this together is called an I Python cluster and then you connect to the controller with clients and More than one client can connect at the same time Which means that you can collaborate with someone else to two people can join into the same cluster and look at the results of a cluster You can connect and disconnect and as long as you haven't shut down the cluster that remains there And so you could imagine starting your cluster doing some computations from from the office getting home Reconnecting querying some data for analysis letting it run for some more time etc. Etc So how do you how do you use it? We're going to we're going to see a little bit more of this When we move to to the hands-on work, but I wanted at least for reference on a multi-core machine Whoops, I need to update this the slide before I upload them. It's exactly start It's missing the word start right there So you would say start local how many however many engines you want and it will start a local cluster You basically don't have to set anything up You can also start it with MPI or you can start it with queue with other queuing systems And in fact what we have with star cluster is that it does something like this It actually starts it with Sungrid engine, which is the machinery that Amazon exposes first for starting starting clusters and And star cluster actually basically Configures everything for this to happen behind the scenes so that with one command you can get a cluster on Amazon that has been preconfigured so Because your laptops are probably at most a dual core a handful of you might have quad core laptops here But that's probably the most that you have Eventually you might be in a situation where you want more than that you want 32 64 nodes I would say that because of Amdahl's law don't think of running codes on Thousands or tens of thousands of course quite yet Okay, because chances are you don't have a problem that that will tolerate running in that in that regime efficiently But it's very possible that you have a problem where 32 or 64 nodes are completely worth it right and 32 64 nodes That's the regime where Amazon can be useful. You can start up to 20 instances I think if you start the largest ones the the eight core instances I think you're limited to eight of them at a time unless you ask for more you can actually ask them to raise your cap But the default cap is eight, but still that's eight eight core instances. That's 64 nodes That's a decent amount of CPU power that you can muster basically with with one command and the advantage The beauty of that is that you can start it Have it running the time you need and then stop it when you're done and that's it You didn't have to buy anything. You didn't have to set anything up. You didn't have to talk to assist at main you can you can basically Initiate it run your analysis and shut it down and you're done and actually for something like that You may end up paying just a few dollars. I mean running an analysis on eight eight core machines I think they cost on the order of a couple bucks an hour each So eight of them would be maybe sixteen bucks an hour and let's say that your analysis completes in four to five hours Okay, it might be on the order of having a poster printed But that's kind of a that's a reasonable amount of money to pay for an expense for a conference For example, you run a big analysis before a conference you it cost you 60 or 80 bucks You print your poster and you have it done What star cluster does is it makes it very easy to not have to have to use that web interface that you saw from Amazon already Which is fine and it offers a lot, but it makes all the processes very clickety and very annoying Star cluster is a library written at MIT that allows you to effectively manage all of that from within your laptop with a Configuration file so once you you've set things up you write a config file and then from your laptop you can start Start clusters stop them see which ones are there SSH into them attach Storage persistent storage volumes to them etc etc the two main pieces that Amazon has and we're only going to go into the basic computing are the EC2 Stuff that stands for elastic compute cloud Which is basically the starting and stopping of virtual computers on there on their network and also the storage Which is named EBS and EBS stands for elastic block storage These are basically just like these are virtual computers. These are virtual hard disks So these are effectively things that are like hard disks that you can attach to an instance And they appear as a directory of your of your instance because by default these instances contain Fairly small hard the virtual hard drives of the computers themselves are fairly limited and furthermore unless you want to keep them private you have to If you want to keep them private you can but then you have to pay for them all the time that they exist But if you don't want to pay for the virtual computers when they're not in use you have to make them public So what most people do is they put on the virtual computers only the software that they're going to use Right, and then they make that public in case anybody else wants to kind of turn it on With themselves paying for the for the cost, but your data you may want to keep in this and so there's a separate You you you mount to this on a separate hard disk and and Amazon charges you separately for the storage then they charge you for the actual compute and Start cluster makes it very easy to manage both the storage and the compute nodes There's another kind of storage called s3 that we're not going to go into now s3 Storage doesn't look like a hard disk vzbs things. They look like a hard disk They appear in a partition you format them like a hard disk you put files in them. They're just like a hard disk S3 storage is more like a database kind of thing They have water called buckets of storage in which you put data into the buckets and you retrieve data from the buckets The advantage of s3 is that it's cheaper So for certain applications, it's worth learning how to use it and a lot of web database kind things can be done with s3 but For starters you should this is I think the model that is worth learning for at the beginning because it basically is Exactly the same thing that you have in a real computer, which is not computers and hard disks And just happens to be all virtualized into the cloud okay, so here are a few a few resources and now we're going to start with Working a little bit with with Amazon and star cluster. So is everyone more or less set up? Okay So now I'm going to switch to emacs Can can those in the back see my screen? Is that big enough is the font the font large enough for you guys to see in the back? Oh, actually, I just realized that they're filming these on video, but they're using a really low resolution 480 so I'm going to actually make the font a little bit larger For the sake of anyone who might be watching this later on and this might still not be large enough But I don't want to make it absurdly big So ah So here's what here's what your is everyone did everyone Successfully do both setting up Amazon and setting up star cluster because we are going to use both Okay, so I'm assuming you already have your AWS info filled up as in here and You have your Amazon SSH key also filled in so this is what this is what your config file Would would look like once you're done I've removed from here so that they're not they ended so that my private Amazon key doesn't end up on on a webcast on YouTube So I've basically made a copy of my file of my actual config file with with this part removed So you should have these three feel these are the only three feel these are the well These are the main three fields that have secret information and then you have to also give the path to your SSH key And once you have done that Then you define your clusters so the way EC2 works is It has a notion of a cluster template Which is basically a description of how you want to start your cluster and I'm going to go through the fields of this Cluster of this configuration and then templates can each template has a name and templates can extend other templates and replace fields in them And so let's have a quick look at the default cluster template that I have in here is called small cluster Each template has a name. It's this field right here You have to tell it which key to use you may want to authenticate with different keys for some reason to Amazon So each cluster can have a different key you ask how many EC2 nodes do you want to start in the cluster? You you ask to create a specific user in the cluster this it's a good idea to leave leave this alone because Sungrid engine actually does certain things with this user By default to bring all the machines up. So I wouldn't modify that you can specify the shell if you want This is a very important field the Amazon image that you want to use an AMI AMI thing stands for Amazon machine image and it's basically a copy of the virtual Hard drive of the computer that is going to boot and there is a listing of Amazon images that you can go and see if we go to the console If we go to the EC2 console. Oh, I have to sign in Da-da-da-da-da So this is the Amazon console and their fonts are such that you can't actually read anything, but so be it Here's the EC2 tab so here it gives me a summary of What I have running how many? Elastic IPs I have how many volumes I have active etc etc elastic IPs are Fixed numeric IP addresses that you can attach with your Amazon instances because when you have an instance running if I click here on running instances It'll show me which instances I have I started a cluster with one master node and four Slave nodes for worker worker nodes all of them on the micro instance and if I click on one of them Amazon will tell me where it's running. This is the URL where it's where it runs Which is always kind of a wonky looking URL like that from that that always ends in Amazon AWS comm but You can ask them to give you a permanent IP address and you can even then configure DNS to have a name associated with it So this is how many companies actually run their web servers. They don't have web servers They run them on Amazon the associated IP address they register name for the company They point it to that IP address and they're done They have basically a virtual data center in the cloud and so if you want to give people a slightly more persistent IP address and these funky auto-generated ones you can request You can request What's called an elastic IP address? I haven't done that And here in images you can see a listing of public images So in this case it's showing me which AMIs I own I don't own any I've been using publicly created ones but you can see If I ask for all images, it'll take a while I didn't remember it taking that long There we go. So this is a long listing of all of the images that are available So these are all of the images that are available on Amazon that people have made public and You can create you can search this this list or somebody could give you an Amazon instance So once you have one of these With this name this is the name that you put into the config or somebody can get tell you use the specific instance That basically says boot that computer and each of these was created with some set of libraries pre-included So the the author of star cluster actually bundles Amazon images that have star cluster Pre-installed and that have all of the standard Python libraries for scientific computing already baked into them So basically you don't have to do that yourself. You can use one of Justin's AMIs and then that will boot up instantaneously with all of the with all of your all of the standard libraries for For scientific computing so what I have done here. So I want to show you guys So this is what the node image field means. I think this one is is the default star cluster image Yes, he has them here in 32 bit and 64 bit These are the default 32 and 64 bit star cluster images the 899 is the 32 bit one and the 999 is the 64 bit one and Here is where you choose the instance type They have a table that explain what these are where the rough equivalent This is roughly equivalent to a computer with this much RAM and this much hard disk and this much CPU and so on and so forth So the t1 micro is the one that you get for free So for one year after opening your account you get to you get to the equivalent allotment You get 750 hours a month off t1 micro which is equivalent to a fairly small laptop It's actually 600 megs of RAM not much CPU. So it's kind of like a net book But you get it for free you can run it 24-7 you can run 750 hours per month of this for free and they won't charge you. Yes Yes, those those you pay for Yes You might yeah, if you're running if you're running the other one you might end up with the bill if you did it since today Probably a dollar maybe So t1 micro I I'm sorry about that guys. I should have announced I don't know that's a good question. We can suggest that they that they actually start with the t1 micro It doesn't matter. They charge the moment it starts up. That's it Amazon doesn't charge for CP They don't measure CPU utilization. They measure the time. They're up. So So I was gonna mention it, but but might as well mention it now. So stopping allows you to restart quicker But they charge you for the time that a stopped instance They charge you less than running but they charge you a small amount for having the instance kind of nearby Whereas terminating it means they delete those temporary files But the next time you start it may take a few minutes longer So it's a difference between taking maybe five minutes or ten minutes to start versus taking one or two minutes So if you're trying basically stopping is what you use if you want to reboot the computer, right? Whereas terminate is what you use if you're not not going to use it But even for even if you're not going to use it for 24 hours You might want to terminate it because the extra overhead is minutes It's not like when you've terminated it takes an half hour to restart. Okay. It's a matter of minutes So I do apologize for that. I didn't realize that the default in this file wasn't micro I guess I had edited mine so long ago that I hadn't I just had forgotten that the default wasn't micro so if Hopefully nobody will end up with a bill from Amazon then of more than a few cents I think maybe up to a dollar. So hopefully it won't be a huge deal for you guys Because I mean we can look into their actually I'm now I'm curious what their pricing I've seen it. I just don't remember exactly what the easy to pricing is but let's answer that question so on demand instances which one is it that he had the Small right, so the small default is 8 cents per hour. Okay So you've been paying 8 cents per hour Of usage Excuse me in our blocks. Yes. Yes, it's an hour blocks per node But they wouldn't start you can only start 20 So in that case yes, so in that case it would be 16 cents per hour No, no, it doesn't No, it doesn't yeah, I think Amazon. I think has a has a console for For for that but but I don't I don't think that star cluster tracks that information You can choose what's called the availability zone And you you really only need to do this if you're worrying a lot about Data transfer from and to a specific location for normal use using their default is fine AMIs live within a region So they're only available in one region and the East Coast US one has more AMIs available But if you need for example if you're running a business or if you're going to actually upload large amounts of data You may actually want to say I want to start in the northern California zone Because my data is actually on campus on a server and I really want the lowest network latency to the Amazon data center So I want the data center, which is geographically closest to me And so you have you don't have a control over the specific Location where it runs, but you're basically targeting parts of the world And a question It's the number of instance, but some of those instances are actually multi-core instances No, the the multi-core ones are these The medium and extra large high CPU and cluster compute I think no I think actually these are the only ones that are that are multi-core that are considered multi-core So for example, if you start these extra large ones Star club. These are eight core machines So you pay only for one instance, but start cluster will start eight eight I Python engines on it because it knows it's an eight core machine. And so For example the well actually the example that I'm going to the non-trivial example that we're going to go over now is An example which was run using four of these So we ran it using four of these which gave us thirty two nodes for the computation, but we were paying 2.4 dollars per Times four exactly. What's that? Yes, you can also choose this is where you could you choose your EBS volumes So once you've configured your volumes you give them a name These are your hard disks and then here you can tell it You can tell it which volumes you want to use and Finally importantly for us plugins so in here We define what plugins we want to use and later and and now I'm going to show you how the plugins are configured So my default one I didn't define any plugins But here I have Ah Yeah, no, I do have so in plugins I said I want the IP cluster plug-in So the IP cluster plug-in is the plug-in that basically Starts I Python with all of the parallel machinery running on it Let me give you a sense first of all of what IP cluster does locally if you run IP cluster that dash dash help We give you some information on how to start it and so on and so forth Our disk is slow IPython Yes, it should so IP cluster is a command that will start the IPython cluster And it has subcommands such as start and stop Oops So this is how you say start with four nodes for example IP cluster start dash dash n equals four That will start a cluster with four nodes and this is what it looks like start dash n four So there IPython started a cluster with fort not with four engines on my local machine so you can do that locally what The plug-in on on star cluster does is it actually configures that for you and configures port traffic And when and HTTP traffic and whatnot so that you can not only run Not only do you all of your nodes come up as a cluster But they also come up pre-configured to be accessed from the outside so it simplifies things quite a bit But you can do this locally which means that you can actually debug your codes using IP cluster on your own computer And only run up on Amazon when you need to run when you say you want to scale beyond what your machine can do, okay? Yes, I would suggest that you that you put that you put the IP cluster plug-in and then we're going to see so here I have When you define a cluster You can actually base a cluster configuration on another cluster configuration So you can say for example, I want a cluster which is just like this other one But changing the AMI or changing the number of nodes or whatever you can change basically only only the parameters that you wish to change So here I was toying with I don't remember what the what the same I was for I I think Justin and I were playing with something And in this case, I wanted the cluster size of one I just wanted to start the node because I was just testing the connections. I didn't want I didn't even want any more engines and in particular this is an interesting one that I Want to show you guys so you don't have to so I want now I want to show you a non-trivial example and hopefully the demo will work on Micro instances and we'll see how it works because I want to show you a non-trivial example of using the API's to motivate that you Can do sort of interesting science with this stuff and then later We'll come back and we'll see a little bit of the basic building blocks My experience has been that if when we demo just the eight the bottom level API and this is how each function works People get a little bit bored. So hopefully looking at a lot of kind of scientifically interesting example will be a little bit more More engaging and then we can go back and look at the basic pieces and I will upload all the notebooks that have That have the tutorial the P the piece by piece tutorial material. I will upload them for you, but this Sure sure I will upload my config file as well. So what does this? Oh, let me let me show you one more thing which is What a session looks like because it could it can take some time I started the cluster in advance from my office and what I did was I required I just copied the shell content to here I didn't know how long it was going to take sometimes when these haven't been started recently It takes five or ten minutes while Amazon kind of wakes up and I didn't want to potentially wait for all that So this is a copy paste from my terminal about two hours ago when I started this so I said start I You give it which template you want to use in this case. I want to use this chime template. Let me Show you again What this chime template is? So it's this template that says I want to take my small cluster configuration, but I want to change AMIs I want in this case five nodes. So I want one master and four engines I want to do it with the micro instance and I want the IP cluster to be configured Oh and one more thing that I forgot. This is where you configure the plug-in. So when you configure the plug-in You should at least put these three lines You give it which class which setup class to use You tell it Whether you want the notebook to come up by default or not I would say it's a good idea to have it come up It makes life a lot simpler and then you give it a password for the notebook. So here You guys I'll give you the URL in a minute and you can log in to mine As well if you want to so you this is how you can collaborate with a colleague You can start a plug it you can start the cluster and then two people can log in if they have the password So I'm going to so the password is this I pipe and it's case sensitive. So it's I Python dash demo dash 2012 and I'll give you the URL The for the where this is running in a minute. So once you start it You say this name right here is this name right here this name chime here Is this name here each cluster template has a name. So you're saying start with the template chime and You name this particular instance of that cluster The reason why you name it is because you can start with the same configuration more than once You might want to start the same second the same config of this with the same cluster machinery Two or three times in a row Maybe you want to start it twice because you want to test one thing with your algorithm While you the part that you know works runs in production with another data set But you want to tweak something you may want to start the exact the exact same one more than one so each time you start it you give it a name and Once you give it that Information star cluster goes to work it begins giving you lots of info it launches things it creates security groups It configures SSH it configures NFS that a da da da It does a fair amount of work for you It really it really does a lot for you and then it says oh you wanted the cluster So now it starts running the IP cluster It detects that you wanted for engines so it starts it and then it says I you also wanted the with a web notebook It creates an SSL certificate for you. It really does a lot. Basically it Half of the Amazon tutorials are bundled into this command and half of our own IPython tutorials are bundled into this one command We're with three lines of config star cluster does everything for you and once it finishes It tells you where it's running so the notebook URL is this so if you guys go to this URL with chrome or a newish Firefox or safari you should be a you should get a login screen and that login screen you can log in with the password That I just gave you ipython-demo dash 2012 Yeah, go ahead. I started that one. We'll destroy it when we're done I'm gonna leave that up there for a second so that you guys and Let me know if you can log in Or if you can't You'll get a warning about us certificates in these these are auto-generated self-signed certificates No browser will give you a warning so tell it to go ahead. I should have gotten an elastic IP for this and Yes, is anyone having problems logging in it worked. Okay, good. Which browser are you using? Oh? You know what you're right safari has a weird web socket problem that I don't know how to work around Can you have chrome or firefox? Yes safari. It's I Yes, but it won't work. Yes, it's a bug in safari with web sockets Yes, unfortunately safari you're right safari. I had seen this problem I said it used to work and recent updates to safari broke something in their web sockets how they manage web sockets And it doesn't actually work. Sorry So either chrome or firefox So once it finishes it tells you in this case the full startup time took almost four minutes The cluster is ready to use you can I can ssh into it now So I can actually type that and ssh into the cluster. I can restart it using the restart command I can stop using The stop command or I can terminate using the terminate command which as we discussed already stop allows for a quicker restarts, but you pay a little bit for the storage of the nodes kind of in frozen mode versus for the storage of the AMI in frozen mode versus The the terminate which actually destroys that and takes a few more minutes to start. So it's your choice, but the total time The total time is on the order of say five minutes So using terminate is perfectly reasonable unless you really want to know that you're gonna want to restart it in The next few minutes or the next half hours. So I normally always use terminate because I'd rather have to wait Then forget that I left something stopped and then two months later come back and realize oh wait a minute This has been running. I left that thing sitting there for two months and it's been charging me for no good reason Not that I know of there may be not that I know of I mean you could you could do it with star cluster So you could do it as a small you could put a cron job Because star cluster does give you a fair amount of information So let me show you how so I'm gonna stop my local the local the one in my laptop now There is yeah, okay so I Don't know But star cluster has a list, but this would be what he's asking for is to terminate if they're not even running Just for having the instances stopped but terminated Okay, oh if it's not running for a while, okay So here when you type list clusters it knows When it was launched and it knows which instance you have and and all of that So I imagine you could write a little cron job using that the star cluster API's that basically once a day monitors And if it detects that the stop time is more than say a week ago then issues the terminate command So it would be pretty easy to write that with using star cluster and since star cluster not just calling it But star cluster is itself a python package I'm sure you can make the low-level API calls to basically get that data get the time stamp as a daytime object You do today you do a diff if that diff is greater than seven days you issue the terminate command So it would probably be a half hour's worth of Star cluster Exercise to write that and put it in your cron as a cron job on on a machine that that is always running Okay, so This is the information that you get from star cluster with list clusters it tells you where they're running their Including including their url So if I didn't know if I didn't know that that url I could always get it from here and by default the notebook is Always running on port 8888 That's where it starts you can change that but that's the default where it starts so everybody should have This Visible is that right? So I want to show you guys I want to walk you through this as our first kind of hands-on demo, which is and we'll see if this works and This is kind of an interesting little story that happened last week So last week I went to Colorado to I had to give a couple talks and I had to attend a workshop on Genomics in the cloud the NIH was organizing a workshop about doing genomics in genomics in the cloud Who here is in the bioinformatics genomics biology field any of you? Okay, so we have one person in the room who actually understands this not me because I'm not a genomicist I'm a physicist by training But me and Reagan Kelly one of the authors of I Python myself were invited to participate in this workshop and One of the things we were going to do was on Wednesday on Tuesday afternoon We were scheduled to give a demo to the audience of how to use I Python in with with star cluster to do Work in the cloud and I told the organizer of the workshop look I can put together an example We have some tutorial examples that are kind of toy problems with matrix multiplications and things like that He said well would be very cool if we had a biologically relevant example for this audience and I said yes That would be very cool, but I have no idea how to do that. However if you help me Maybe we can come up with something and so the Friday before we had a brief discussion We're him and an ex post of a postdoc of his who's now at Northern Arizona and Flagstaff Came up with an idea and they sketched it out to me and I said if what you are describing is really what the code needs to Do I think we can do that? Okay And that was the end of it We just kind of talked about it on Friday then Monday night We sat down in the hotel lobby after the first day of the workshop We sat down kind of went over the basic ideas We started checking that the AMI was was working just and Riley the author of star cluster was at MIT over IRC helping us so he started getting the AMI ready and We we sketched out the pseudocode for that and then we went to bed. That's it We did no actual coding just an email with a little bit of pseudocode and then on Tuesday morning at 9 a.m We actually started coding the demo was meant to be presented on Tuesday at 4 30 And we had basically two people Min and I who knew I Python very well and had no clue what the biology code did We had three biologists who had written the bio the genomics libraries who had never used I Python or star cluster had Never heard of it and we had the star cluster author at MIT Helping us with the AMI's and we started all coding because we were like you are logged into the notebook Everybody can edit the same notebook at the same time now We don't have Google Docs style real-time synchronization yet We're working on kind of implementing that but at least Multiple people can edit in the same notebook The only thing is that if one person is going to make a lot of changes you have to say I'm editing So what we did was we made a few copies you can ask here make a copy So we just made a few copies and each person would edit different cells and they would say okay This part is ready and then one person would copy paste that into the other one So we effectively were merging manually But it allowed us to basically test it in the real cloud environment and start building the parallel code and Around 2 p.m. All of a sudden I saw Rob Knight the organizer of the workshop and Greg the Northern Arizona fellow and they were looking at a plot and they were very excited Asked what's what's going on? Why are you guys so happy and they said we think we need to write this up as a paper? Because this is actually a biologically interesting result with the analysis that led to a very interesting finding so they called the editor of the ISME journal, which is one of the nature journals that focus on microbiology and Ask can we submit a short three-page commentary? For for short review as an outcome of this workshop and the editor after they told her the story was thrilled And we basically finished the paper yesterday now. We're just polishing formatting. It should be submitted by the end of today I had never seen something like that where you start writing code at 9 a.m. And by 4 by 3 p.m. You're ready basically to write out the paper I'm not saying this is always gonna happen But this was I wish it did but I know I know better than to know that these things only happen once in a blue moon But it but what what what is true is that the tools did enable two teams or three three different teams who had Basically very little overlap in skill set to work together in parallelizing a non-trivial analysis and running it in the cloud In fact what we had run at that point ran Would run in about 10 minutes with 32 nodes for the real production run for the paper We we rerun the analysis and the rerun of the analysis took about 24 hours Because we basically made the made that we scan a larger set of the a larger part of the parameter space Just to make sure that the results were robust and that took about 24 hours of clock time So that one was restarted on Wednesday, and it took about until Thursday to finish But that was about a month's worth of CPU time We measured how much total CPU was used and it was an analysis that would have taken a month So basically with these tools we started from scratch on Tuesday by today We're basically finished with the paper and done if they had tried even if they had had that idea And they thought it was a good idea to run it if they had started to run on one machine On Tuesday they would still be three weeks from even finishing the run And I think the total cost is going to amount to See we're running Probably a few hundred dollars. So I think we're I think we had and The two the two dollars and 40 cent instances. I think the total run was done with Five or six of those and they ran for 24 hours. So it would be about ten dollars an hour for 24 hours So a few hundred dollars. So it's not a trivial amount of money. Well in this case We had money from the grant organizing the workshop. We had up to five hundred dollars to spend on on CPU time but But I don't think a couple hundred dollars is is an outrageous price to pay for a valuable result that you can get That you can get a paper out of and considering that once we shut these machines down. That's it We don't have to administer anything. We don't have to worry about the machine So yes, you can buy hardware for a couple hundred dollars But not that much right and Administering a large cluster of 64 nodes begins to be enough of a concern that a it's it's a lot of money to buy Be power dissipation and management of a 64 node cluster cluster begins to take real work So this this is a sensible alternative. And so now I'm going to show you guys what the actual execution looks like so in This case we have four engines available. So this is what this is what the API looks like you say for my Python import parallel and You make what's called the week typically call it RCA remote client It's an object If you want to use the IP cluster one you have to use the pickle packer. This is just a little technical detail This is included in the star cluster documentation and then from a client a client a Client is something that has the IDs of the engines But the client itself doesn't directly let you send code for execution because it doesn't know how you want to execute the code Do you want to execute the code on one engine on all of them? There's Broadly speaking there are two classes of policy for a code execution that you want to have in pearl computing Either you want to have a chunk of code that needs to go to everybody or at least a given subset of your engines Or you have a piece of code that needs that needs executing and you don't care where it runs You just want the answer to come back. Those are broadly speaking the two classes of Execution policies that you have and so what we do is the client returns objects that are called views and The view is an object that has a specific policy for execution So when we say I want a load balanced view The view object is now an object that knows how to execute code But in a load balanced fashion so what what that means is that the view in this case has four engines and These four engines when you give it a job to do it will give it to the first that finds And so if you give it ten jobs to do it will begin sending them to the four engines And as they get completed it will keep them busy But in a load balanced manner the other type of view is what's called the direct view Which is meant for sending code to everybody and we offer both because in the same session you may often want to do both Sometimes you may want to do. Oh, I need everybody to load this data set I need all my engines to initialize by loading these libraries and once that has been done Then I need an unload balance fashion to run over this data set in little chunks And so I don't care who does it I just want all of these jobs to be done But when they're done I want to collect the following variable from all of them to do aggregate statistics So because it's common to need both ipython offers both objects or direct view and a load balance view so And and in this case what we're doing when you slice the client what you do is Get a view object, which is a direct view and so this is in one line Getting a direct view and writing to it as a dictionary So it turns out that view up to direct view objects can be manipulated as a dictionary So if I say if I make a direct view from the from all of my engines Then I can say direct view and I can send data to it So in this case I sent a variable called tutorial whose value was 10 to all of the engines and in this case And by doing this I retrieved the variable tutorial from all of the engines because I have four engines I get a list of four numbers in this case four copies of the number 10, so This is useful when you've computed the same thing either because you need to send a flag for example to all of your engines Or if you've computed something that has different values, and now you want to summarize it you want to do aggregate statistics on this side So it makes it very very simple to push and pull data to your engines and In in our documentation we have full-length tutorials, but I want to I want to show you guys this example Which is sort of a non-trivial one and just explain it bit by bit. So This is simply a little utility to Which is used to simply monitor on? To wait on a result that has been sent for execution So you will see later that when we send results for execution You can do it into in one of two different ways You can send a result for execution in what's called blocking mode or non-blocking mode And this is an important concept of doing work in parallel in general What is the difference between blocking or synchronous mode versus non-blocking or asynchronous mode when you do something in blocking mode? what you do what you say is Execute this and I'm going to sit here and wait until you're done, and when you're done I want the answer back, okay, and I will stop Executing my own code until you're finished because I want the answer to be the real the actual quantity computer You can also say do this and go on and I will go on doing my own thing And when I need the answer, I will try to get it, okay? So I python allows you to do both synchronous and asynchronous execution sometimes Excuse me synchronous is nicer when working interactively synchronous execution is nice Because it lets you kind of do something wait for the answer and get it back So it feels a little bit more like the interactive workflow But if you're going for efficiency, you're often better off doing asynchronous execution because it lets you send large Amounts of work to machines and continue doing something in the meantime, okay? So this is a utility which on an asynchronous result tries to show you a little bit of progress and so a Synchronous results have a ready method which tells you whether they're ready or not and you can wait on them So this is a utility that simply waits for a second on them while they're not ready and prints a little bit of output, okay? So we'll see later that it will be useful basically to give us interactive feedback on how things are progressing as we execute things And by the way, I am not exactly sure how well this is going to go I'm actually going to remove even a few more from this data set because I'm not really sure How well this is gonna and it might not be worth the full length? I think one is going to be a little bit too long Okay, so these are So the analysis that we're going to perform is if my understanding of genomics that these guys explained to me is correct on eraser So the problem in question is we have a bunch of genome sequences So these are sequences like this that consist of the letters a g c t a Etc etc a possibly with dashes and all of these in principle Can be aligned to each other with gaps between them and the question is if I use the entire sequence to reconstruct if I use the entire sequence to try to reconstruct the phylogeny of Of this basically where it came from in an evolutionary sense Which I will get one specific tree from the entire sequence Where it came from? But I can also try to reconstruct that information with partial data from the sequence only using a subset of the sequence And the question is which parts of the sequence? What is the better strategy if you're going to try to reconstruct the tree with partial data Is it better to do it with as much partial data as you can or can you get away with? smaller amounts of partial data and it turns out that the obvious intuition was as well if you're gonna if you're going to throw away Data the less you throw the better so longer sub sequences should be better and the interesting finding was that? by by looking at this we found out that you can actually get better Reconstruction with partial sequences if they are appropriately chosen it turns out there are certain regions of the sequence that they have Identified as having very high variability and some of those regions turn out to be the most informative ones And so the analysis was to try to find What what is the effect of sub sampling in the tree reconstruction so what these what these? things tell us are exactly The the boundaries for these regions where to start and where to stop reading think of them as over there Basically for each of these start here and start stop here start here and stop here So these are the different start and stopping points and they have names for them and I removed the one that said full sequence So the example that I'm going to show you guys When we run it and it may not finish it's probably not really very meaningful because I removed the comparison against the full length So it won't it won't actually give anything biologically significant But the point is to make it run quicker because I didn't want to turn in the turn on the large instances and pay for them And I don't want to run it on the real large instance where this is actually running because we do have the real results running But because the other guys are finishing up the paper and they may I don't want to mess up the results while while they're actually Running the paper so This is how This is how you you start parallelizing a script So this was the the part of the of the code that had to load the data in this case It doesn't matter too much what it is in in each of your problems The code will look a little bit differently, but we've all had code like this which is look for a file See if the file exists if it's already there skip it otherwise Call some reading some library that knows how to read my own data figure out where the data is do something to it And then return the name of the data that I loaded It doesn't the specifics don't matter a whole lot The point is in order to do this in parallel What we do is we take this which would be a typical kind of script that you would run to load a data set And all you have to do is put it into a function That's it put in a function because the basic API for sending things to execute in your engines and for Defining tasks is functions. So this is the line of code that actually does the execution So what this does is it calls the function load sub-alignment right here and it maps it to the sequence files and the region boundaries So this is map it maps it in non-blocking mode and the map this this the syntax of our map Functions are exactly the same as the syntax of the built-in map function. So you can say map Function to a set of sequences map is a built-in of the language So if you know how to write something using map, you know how to parallelize it with ipython because basically what you do is if you have a function and arguments in Python you would do view dot map function arguments and that's it Okay, so it's the exact it's the exact same thing except that we have map By default map whether map is synchronous or asynchronous depends on whether you have a flag called block Whether you have a flat flag called block true or false you can change that state But if you want to be explicit you can use map sync or async which are always explicitly synchronous or explicitly asynchronous regardless of the value of the of the of the flag so Oops here. So once we've done this we have an object called and here we go We have an object called a mr. And now I'm running that Wait on a mr. Which basically tells me? What has finished so it was a good idea to start it and see the Yeah, it's actually it's actually going to take a while So we'll see we'll see we'll see we'll see how long how long they take to finish I probably should have started this even earlier I mean we can we can keep working here Because the note we are operating on the master node which is where the notebook is running right the engines are over there busy doing their thing So we could be looking this is why I was able to run this code right here Because this is waiting and it doesn't matter right now my kernel then the notebook is busy because it's giving me this information But I can always stop I can always stop this current this information Right here by by sending by using the stop button or sent which is or using the manual option Kernel interrupt and then I stopped that so I can continue working and And then hopefully that they'll finish soonish we'll see I mean it is it is making progress But it is slower than I thought. Oh, here's my water If it doesn't complete it's not a huge deal Because I can't show you what the completed results look like what I want to show you really is the code And show you that this is really executing in parallel and letting you continuing to do your analysis While the engines are busy you're in this case the foreign the poor four engines are busy doing their thing So here's another little utility to print some statistics Given an asynchronous result to summarize Timing information. Okay, why are you busy now? I can't I can't call it Yet I can't call the one to print parallel statistics yet because this guy probably still hasn't finished Let's have a look and see where it is Still 21 out of 56. Okay, maybe I should have done. Maybe I should have cut even more of these out. I Left oh The ah shoot will I did leave all these three percentage ranges Okay, I did leave So it's these all of these boundaries But for each of these boundaries, it's evaluating at a certain percentage of matching and I have three percentages of matching So the the total list is the the cross product of all of these at Percentage at percentage 76 at 98 percent and a 3% matching. This is a parameter of this algorithm I don't know exactly what it measures. It's a level. It's a level of of match In the trees, but it's doing all of these It's checking all of these different pieces of the sequences, but for so I defined seven regions But these seven regions are being done For all three and now why is it What's that? Oh, yes. Yes. I'm sorry. Yes. It's a list between 76 and 98. Yes, which means it's a list of base it's all of these values 56 and That's what I should have I forgot to change also that one So I I cut out a lot of regions from the analysis, but I forgot I forgot just to make to make this Significantly shorter. Let's see where it is. We found a bug I was gonna cancel them and restart it and they sync map results has no attribute client Okay, so this is a bug that I need to file and we need to fix that the abort call is The abort call is failing. Okay. We need a we need a test for that And we need a bug for that. Paul, can you report that one? Thanks Min will have it fixed by the end of by the end of this by the end of the stock. Well, actually, I know how to do it Now yep No, but you'll be able to re-log in We'll all have to re-log in What's that? Here, let me show you the trace back calling data board on an async result results in an attribute error line line 198 of Put it in a guest so you can copy and paste the trace back 2347182 What's that? So let's see it's almost ready now. It's setting up configure password less SSH So now you see it's not that bad to restart it. It's actually not that bad Because once it gets going these parts are reasonably quick The what I've noticed is the very the initial sometimes it takes a while just to start at all And that's when the instance hasn't been moved over to the kind of accessible discs when it was terminated long ago And they flushed it out then they Amazon has to copy it out from the kind of the long-term S3 storage Into the usable storage and that's the step that sometimes takes quite a while like you you're waiting for five minutes And nothing happens this but once it gets going this part is typically a couple of minutes It's that initial part that can take five minutes or ten minutes or so there we go See the total was two minutes two point six minutes. I wasn't to that and did we get the same URL or not? 5017 172 70 5017 172 70 nice. We even got the same URL. Okay So but I do have to re-log in and so will you so I Python dash demo dash 2012 log in I Open this guy So I'm gonna run this I'm gonna run this Because they're running fairly slowly and I'm gonna remove and the those base percentages Make it a lower value just that so now we have a total of six tasks instead of six instead of 56 So it shouldn't be too bad. I mean, I know they weren't exactly screaming but six will finish much much quicker than 56 and We should have been able to abort it But obviously we just found a bug in I Python so we need to fix that Yeah, the abort wasn't I mean, it's kind of a brutal way to do it obviously and And I could have gone and and and stopped the processes manually and whatnot But just a rather than because the star cluster lets you SSH into into the note yourself And so you can go in and type IP cluster stop and we add the engines and so I if I had done that I wouldn't have had to completely reboot I could just stop the engine processes and restart them again by typing the IPython cluster commands myself Because if you do this star cluster SSH master and the name of the instance there you go your SSH in That's it. You've SSH into the cluster, right? So you can run top and see who's doing what you can SSH into one of the instances and see who's spending time So now you can SSH into node Zero zero one for example. Oh, not as rude. I have to SSH in as as user That shoe Why is that not working from here? Anyway, I thought I thought I could SSH into all the nodes in this way But at least I can SSH into the master node and and kill the commands from there I'll have to ask why just and why it's not working because I thought you could once you'd SSH into the master node I thought it set up so that it could SSH into any of the nodes any of the instance nodes Directly from the master node and I don't know why I don't know why that's not working Okay, so it's done in this case It took a little over a minute and it finished all all six tasks in over a minute and this was the slowest part So now now this is the output with the partial alignments And here we can print so this little utility gives you some statistics of the time And we can see here that the time was dominated by one task that took the longest time but in total We basically waited just 66 seconds and we can see that the longest task was 66 seconds So one engine ended up stuck with that But at least all the others got done in the meantime And so the we only had to wait as long as the longest task took and the other engines to care of flushing the rest of the tasks And this is this is why you want to use load balancing right because you want you want your engines If one it when when the distribution of times between your jobs is on even you want to be using you want somebody to keep Feeding jobs to the ones that become available as they become available So now the next this is to show that often in this case The next step in the process is actually a Python script So obviously in the cluster we could have used it as a library and use the Python API for that library But we're actually calling it as a sub process because it's very common To do this kind of thing it's very common to have an analysis pipeline where some intermediate step is a command line call that you have to make To another process right and so here we're calling we're writing as a function Something that takes a file name and calls the next step of the process with certain parameters the parameters What they are doesn't matter a whole lot uses sub process to call it and then returns the name of the output And so in this case This one we're actually going to tie we called it with map sync This one doesn't take almost any time at all So I called it with map sync and it took a total of 1.6 second and that's it it finished I actually called it with percent time so iPython would will give us timing statistics And so all we have to do is define this part of the script as a function and that's it you call map sync You give it the the names of the This sub aligns was the name of the outputs that we got it from the previous one So we got this Right here So this is how when you have one of these asynchronous results Which as I said is a promise on something that will be computed and may finish later When you want the actual result you have to call dot get on it and this waits and gives you the actual data So the amr object itself Isn't the reason the the what you ask to be computed This is a proxy that has some information about the computation being done somewhere else and with dot get It waits for completion and gives you the actual value Okay, and you can call dot get with a timeout in case you don't want to wait forever You can say dot get with a timeout of one second and if it doesn't If it waits for at most a second if you simply say dot get it just waits until the actual answer comes back And then you have the value So we got we got the the output of the second of the previous step. We feed it into the next step Now we have to build all these trees. So this is again another function That actually Will take some time And in this case, it's another function using these libraries that aligns all of these different trees Take some file files as arguments calls routines Specifically what this is doing doesn't matter too much The point is you're seeing sort of how to structure a long analysis pipeline break it into chunks Make those chunks into functions and once you've made them into functions Where you can say what are the that the arguments of one can be used to feed into the next Then the actual execution in parallel is pretty straightforward because this is all it takes It's calling various versions of map sync map async With either a direct execution view or a load balanced execution view And with these little utilities that we built you can Print statistics. So let me execute and you can queue up. So ipython shows in here This star right here means that this cell is being is executing And so that ipython is busy waiting because it's executing this cell This is right now on the notebook on the node running the notebook, right? This is the kernel the kernel for this notebook is the one that's busy In this case simply busy waiting, but you can queue up multiple cells. So I I hit shift enter also on this one So now this cell is also queued for execution and once the previous one finishes Then this one will finish as well And I the same thing I can queue this cell for execution And then you can keep on queuing cells for execution and then they will they will just get Get completed Now in this case we're building a matrix of distances So we're getting a bunch of trees and we need to build all the pairwise distances So this is an object this is a function Imagine you have to build a bunch of pairwise matrices And you want to do that in parallel and then one way to do that is to have each node Have information about part of that matrix and compute it into a matrix that has zeros in it And as long as they're not overlapping you can then add all of those matrices back at the end And that'll be a very quick step. Okay. So here this finished And we had to wait a total of roughly 135 seconds and that was actually how much the longest task took In fact, there's a slight discrepancy in the clocks Where it actually thinks that the longest task took 135 seconds But that the client actually only took 134 to wait I'm not exactly sure why there should be any discrepancies in those clocks. It may be just a little bit of rounding Because there is no way no way that we can we can get a task in less time than it took for the longest one of them to actually complete Now once we do this once we define this function that basically initializes at zeros That initializes at zeros then here we We call the comparison function and The way in which now we can compute the full function the full matrix is this So this is a very elegant trick if you see here amr is the async result Which was the output of the output of Running the comparison function of all over all the trees and we've written A simple function called That does a summation of the two matrices and in this case actually prints progress as it does it And this is how you can add this is a very elegant trick python has a function called reduce The python function reduce takes a function And a sequence of arguments and begins calling that function With the first two and then the next the output of that and the next and the output of that and the next And the output of that and the next so if you call the function a plus b On a list it sums all that list right because it adds the two the first two And then the output of that with the next and the output of that with the next Well, it turns out that asynchronous result objects can be looped over As a list and they can actually be looped over While the results are being computed So by writing this code right here We're calling the sum function, which actually prints as it's as it's doing it And letting it add these matrices as they come back from the nodes And so we're actually Looping over the output of the parallel computation and aggregating the output of the parallel computation as the nodes compute The output and stacking these matrices over to to get the combined matrix and now we can see we can actually look At this combined matrix in this case because we had such few tasks The results are probably not terribly interesting And now the final step was to call a couple of just just a couple of command line calls for the visualization plot For visualizing the final result. Okay, so since I don't really know about the biology And that's not what really matters to us here. I'm not going these last two steps Don't take any extra they don't take too much time and this one simply generates Generates a file that has to be opened with a Oh, I don't actually know what was wrong in it But it's supposed to generate a file that is loaded with a java viewer That that actually doesn't run on my machine So the point is at the end this was just a local step i python lets you pass command line calls In the notebook if you start with exclamation anything that you put after that goes to the command line Which means that if you need to call system utilities in in your in the cluster You don't have to ssh through a separate terminal. You can do it all within the same node environment So this is actually the complete analysis that we ran And as I said the the final the full blown thing which we started on I think 30 some nodes Uh ran in 24 plus hours and analysis that took about a month's worth of CPU time and this is This was the real code. It really is not very hard. So hopefully Seeing a real example in action will make the tutorial materials a little bit more understandable We have on the ipython because now I want to switch over to paul to talk about gpus Here if you go to ipython.org We have extensive documentation and we have videos and there's a video off whenever the So the third the third video that we have a bunch of them, but the third one is a long three hour tutorial Uh on all of using ipython The first hour is the general part of ipython The second hour is the notebook and the third hour is the parallel machinery in low level detail So rather than going over that here, I wanted to show you guys an integrated example And that material you guys can view it there and in our documentation All of these the tutorial notebooks, uh, our documentation has An explanation of these apis and I will put on bspace the notebooks for that tutorial So that if you actually want to get the low level parts and needless to say ask us But as uh, as hopefully you saw here This is running a real world example And it it isn't that hard the moment if you can basically the the take home messages if you can break it up into functions You can parallelize it very easily with ipython number two You should use it ipython is not good at transferring large amounts of data between your notes Okay, that is an important point to keep in mind Which is that if you try to use it when you when I use that syntax with the with the brackets to send and retrieve stuff It doesn't matter what you put in there But if you start putting gigs of data in there, you will kill the performance of the system the trans the data transfer facilities are meant For the transmission of small amounts of information the values of parameters the names of files things like that If you need to transfer move a large amount of data between your nodes You're better off using a shared file system or using something like mpi Which is optimized for high speed transfer of low level objects We do have optimized the transfer of pure numpy arrays to be as fast as we can make it But still the in general the flow that you want to keep in mind is you break up your problems You use ipython to do the orchestration and the transfer of parameters And then you let the engines do the heavy the heavy lifting But don't try to use it to distribute large amounts of data between engines because that will be very slow, okay Any other questions? So they already are not only is it easy with star cluster it's trivial They are all seeing so let's have a quick exercise before I shut down Let's How would we do that in here? So i'm going to write a function that does ls right so if I say import os os dot list There that gives me A listing of the of the local of the local files, okay so if I say Sorted List there and I look at the first say four Okay, these are the first four files in this directory, right? So now I want to run this on all my notes to see if they all So i'm going to say oh I make this a function and that's it and now i'm going to make a direct view which is A view of for direct execution and I say direct view Dot map And i'm going to make it synchronously because I don't care ls What did I do wrong? Oh, yes. No because I don't need to There's a decorator for that Um, i'm sorry Oh, the other one was a load balance view The other one was and in this case i'm just going to use a direct view because i'm going to send it to all the engines, okay, so So here i'm I'm calling this function ls over the range of this is a slightly hackish way to do it I need to actually give me a second and i'll look at i'll look up while Paul sets up i'll look up a slightly cleaner solution than this but I'm calling ls over the range a range of the length of all my engines so that it goes one to each engine And then you can see that all of them report in ls the exact same files. So they're all seeing the exact same file system And that's courtesy of star cluster part of the part of the startup that we saw here was this Star cluster actually configured nfs for us and made sure that all of the all of the all of the engines and the Master node were all configured as part of the same nfs environment so that they all see the same file system And if you had volumes ebs volumes mounted They would also be exported mounted on the master and exported over nfs to the clients So all you have to say is this is my ebs volume with my data set Make it available and star cluster does every all the configuration for you So it's a fair amount of unixes admin rolled up in one in one single command, okay Any other questions Okay, so while Paul sets up i'm going to i'm going to look up a syntax for a slightly cleaner solution to that last one Um, i'll just pull it out. I don't need it, but you can just pull it out. So that's not in your way Do you need the pointer? Sure. You can you can have it if you want to thank you the mic Yes Good catch Oh, that's right It's for posterity, right I'm not going to under his breath, but what am I doing? Zing I thought I was harsh to Fernando. Um, all right. Hey guys, give me a sec to set up This doesn't always work the way I want it to the first time So as soon as I press the button that doesn't work. Yes, sorry The method I was looking for was not math was applied I forgot there was also an applied method just like there's a map There's an applied so since all I wanted to do was call the ls function on all of them If I simply say apply ls, then it would have done it All right, so hey guys So i'm going to talk a little bit about GPU programming. Um, this will be a high level talk So I gave a talk on this and the last time josh taught this class, which was Last fall 2010 Fall of 2010 that's right. Um, and it was presented on november 1st So it has a little bit of a Halloweeny theme Um, and I decided to sort of um last time I had an hour and a half And I actually thought that I did a bad job doing some of the early things high level things and I expanded that And I tried to shorten the the end. Um, but I don't think I shortened enough of it So we may not get through all of it, but sort of the high level things would be there Um, I'm mostly just a user of this technology This isn't something that I'm sort of doing active research on and and I want to make this sort of a folksy overview So that you can actually walk away with an idea of of gpu programming In your head of what that is and then once you dive into the details, then you'll sort of learn Learn the details. I do say that, you know, gpu program from python And there's literally just one slide with python code on it So sort of and we might not even get to that. So that's uh, The um the from python portion sort of parenthetical Okay, so like I said, it'll be a Halloweeny theme so, um Uh by way of analogy talking about a gpu's and solving some computational problem We're going to talk about having a candy eating contest And so who do you think would win in a eating contest? Let me do this so that I guess I'll just live with it Do it twice who would win in a candy eating contest would it be fat bastard versus a toddler? um So this happens to be my daughter um on halloween and so of course One toddler versus fat bastard isn't really a fair contest right fat bastard would just Eat everything way quicker than a toddler But uh just to be sort of fair to the youth of tomorrow. What if we had a bunch of kids and so um So here i'm showing a handful of kids But really you can think of um Uh and this will sort of translate to the gpu world you can think of this as uh fat bastard having to compete with the whole school of kids Okay, so um things get a little bit more interesting and so who would win then any takers The kids well actually it's a sort of a trick question It depends it depends on just how much candy there is to eat right because if it's sort of just one bucket full of candy The kids will spend their time Sort of trying to divvy it up and it'll take them an hour before they settle on how much is fair to have for each kid Right you have a communication problem. Where's fat bastard would just load it all in and just sort of munch away, right? Uh, but if it's a full truck load of of candy Um, then fat bastard sort of doesn't have a chance right the kids were once they're So long as they're sort of organized in their approach They distribute all the work they distribute it, you know, maybe crates full of candy To this group of kids and another crate full of candy to another group of kids What happens is that they're they're munching away. So any given kid is not eating faster than fat bastard, right? I mean, that's just you can't compete with fat bastard, but Sort of sit again Fat bastard can eat the kids. Yes, I didn't stretch the analogy that far but um These aren't babies we're talking about these are sort of toddlers, you know He was talking about babies getting in his belly. So I think the toddlers are safe And the toddlers are pretty fierce. So this this sort of analogy as uh, as I said sort of applies fat bastard is the cpu in a way And and uh, you can think of the gpu as a whole school of toddlers Sort of ready to eat your candy But they're remember they're not full grown adults, right? They're a little bit less sophisticated a little bit more naive Um, uh, they're not as fast. So the gpu clock speeds aren't as high as cpu clock speeds Um, but uh, so their power is in numbers And they do need supervision and they need to be directed and they really work best in unison Sort of if they're all doing the same thing The teacher tells them, you know now we chew chew chew, you know, they can do that Um, anything outside of that, uh, then that's where you start to sort of lose your performance If every if every kid is doing their own activity, they're not actually going to get things done faster than fat bastard Okay um So why why uh, why am I talking about gpu's? Well gpu's uh, can take over the parallel Parallelizable portions of your code and do them efficiently So you don't get to beat amdahl's law as as sort of fernando talked about previously But provided that you have enough parallelism in your code that can be taken advantage of What we're seeing here is a plot of uh, uh, two to the graphics card vendors and the uh, the The number of floating point operations, uh, 32 bit precision floating point operations that they're able to do Over time And and here's uh, the slide is a little bit outdated, but but here's the the cpu world, right? So even though the cpu clock speed each individual cpu is running way faster than the gpu Because there are so many gpu sort of compute units working together They're able to their their total throughput is much greater than um cpu throughput and this this this trend continues and Uh, there's further sort of multicoreization of the cpu's right, you know, you you have quad cores You have you have eight cores. You have 16 cores On the cpu side as well. So this sort of this problem continues to grow um Or rather it It continues to be the case that in order to get more work done You just have to do more in parallel or that's that's the only place to get the speed ups because of the rocket nozzle Sort of heat issues that fernando talked about Get my time here so um okay, so the um The sort of as an overview of uh, gpu computing again, the uh, the cpu's make a single program sort of run very fast. Um, They there's a lot of uh architecture in place on the cpu to Burge the gap between the super ultra fast cpu doing the actual computation And the relatively the sort of several orders of magnitude slower um memory that uh, uh that Where your the the things that you want to do in your program resides and the even Several orders of magnitude slower hard disk where maybe your data lives on Sort of the bridging bridging that whole gap a lot of uh, that's where a lot of architecture And a lot of sort of iron is dedicated on the cpu to making that particular Um set of problems run faster on the cpu side What matters is the the throughput not a single thread But if you have thousands of threads running in parallel sort of in lockstep And working each on a slightly different subset of the problem that you're interested in That's that's sort of the strength of uh, gpu computing Um, so it's it's about how much time it takes to do the all of the work Not how much time it takes to do a single sort of work unit of work And so uh, so the cpu style cores are laid out Something like this and so remember cpus like fat bastard You might not think of this Adjectives as applied to fat bastard, but they're sophisticated, right? They're they're complex. They're um, there's you know, you have you have fancy branch predictions you have lots of Core dedicated to caching things so that when you need to look them up again or to prefetching things that You might in the future use sort of a lot of guesswork a lot of magic that happens that the programmer doesn't need to end up needing to care about Um, but the kids are more simple minded. So we're sort of going to strip away this side Oh, and I should I should mention uh, you're seeing uh, uh slide credits here This is sort of a frankenstein of a bunch of uh talks from from different people Mostly from andreas clocktor who's kind enough to give me a slides. He is the lead author lead developer of pi kuda and pi open cl Both python wrappers for kuda and open cl So slimming down one of these cores. We're going to get rid of A lot of these sort of what i've called the sophisticated components and just make one thing Run fast and then we're just going to replicate. We're going to make two of them and then four and stop at eight Or 16 rather because I doubled uh twice. So we have 16 independent destruction streams and um But in reality, they're not really independent, but uh furthermore You can think of one of these guys What after I do the next slide you can think of this as being sort of one classroom So each classroom and each alu is a kid in the classroom So this is a sort of there are eight kids to a class here and um, it's an awesome student teacher ratio And each kid has his own little context, which is his desk the place where he's going to sit The the things that he's going to do and they're actually since they're all in the same classroom They're going to have some shared context. There's going to be some things that that they're available to all of them They're going to be able to look around in the room They're going to be able to hear the teacher sort of broadcast things to them and they're going to be able to to talk amongst themselves and to To synchronize and to do things And here I have the little Word that uh, or the little acronym is simd single instruction multiple data So the idea is is that the teacher can give him all the kids one instruction the same instruction But because the each kid has his own sort of datum that he's working with Lots of work will will happen in parallel. So single instruction multiple data Um, so we have multiple al use but only sort of one one set of things that that get the instructions So then zooming this back out into our view Of our full school. So we have school and we have 16 classrooms here and each classroom has eight kids So, um, really we have 128 kids at the school 128 things happening in parallel 16 independent groups. So those are sort of the classrooms each with eight synchronized streams each with eight kids In there and which is great that these kids can do some serious eating so long as they're all doing the same thing Um, and so what happens on the gpu when they don't do the same thing? well Remember the kids Work best in unison, right? So if If they're not doing the same thing then we have divergent streams and basically We still only have one teacher telling everybody what to do. So, uh, let me Show that example. Oh, that's not coming out very well here All right. Well, this is this is some code that's uh, that's not showing up, but there's, um, There's an if statement and it says if x is less than zero do something else. Uh, do something else part and so, um, suppose that, uh, x is greater than zero is sort of My apologies for this is berkeley and i'm briefly going to perpetuate the gender binary But the teacher wants to do something based on sort of the presence of the x chromosome So, uh, if x is greater than zero, maybe that'll be the girls. So there's three girls in the class and they're going to do something otherwise, uh The boys are going to wait. They're not going to do anything while the girls are doing whatever it is that they were instructed to do And then when the girls are done, then the teacher will switch to the boys and say, okay Now boys, it's your turn to do something else and the boys will do something So this is this is how branching works on the gpu because you have there's so much sort of upfront investment in everybody doing the same thing, but When they're not doing the same thing It is it is. Yeah, yikes. Wow. This is this is some heavy base that's, uh intruding into my talk here Frightened, uh, it's supposed. Yeah, I don't know. Uh, I'm scared. Okay. It stopped good Oh All right, so, uh, then the boys went and so, uh But we still have some problems here, right all the things that we've gotten rid of that that exist on the cpu side But don't on the gpu side were there for a reason They were there to bridge the latency gap between the ultrafast ALUs and the pretty slow random memory access That's what sort of caches are for They were there. Uh, uh, there's some fancy branch predictions sort of, uh The cpu tries to be clairvoyant and and a sort of a precog and guess into the future about what the outcome of some computation is Going to be and then continue Sort of start doing the things that it would have done had that thing happened and then it has the ability to like Drop all that work that it did if if the the sort of the pivot that it rested upon happening Didn't end up happening then it sort of can drop that work Or if it did end up happening then it's already halfway done to getting all the all the following things done And then um, sort of out of order of execution is the same thing the cpu can Reorder independent steps while maybe it's waiting for memory or waiting for for for things to happen So how can we get around this problem? Well, we can stretch our kids analogy even further um, and so, uh The idea is that even more parallelism and some extra memory will get a solution Um, so in here we we as we described at our classroom, uh had sort of just one set of desks But suppose, uh, we had several sets of desks within a classroom And so the the the desks will sort of define what sort of activity we're gonna We're gonna use so maybe one will be for uh our our desks where we eat our candy and two will be where we do Maybe some painting and three is where we go when we do singing And so so each of these contexts is sort of a a uh, it's a It's a well, sorry each of these these sets of desks is a context for For a particular kind of activity that we're going to do and what's going to happen is, um Is uh like I said because we haven't gotten rid of uh because we have gotten rid of these prefetch units and a lot of the cash Although I should say that sort of the newest generation gpu cards are becoming more cpu likes. They are getting some cash In there, but uh, sort of in the older generation and in the general case, they don't have cash. So here we're Task one we're at desks Set of desk number one. So we're eating our candy And and that's all fine and well, but then once we finish our candy Uh, we're we're we're stalled we we can we ate all the candy that we had Right that uh all our kids ate the candy and we've called in the next creative candy But that it's going to take a while before it comes over from the truck, right? It's going to take a while before it chips in So we're stalled And and what are we going to do? Well, we're going to keep ourselves busy While we wait for the candy to come in we're going to switch to our second Sort of set of desks and our second set of tasks that we have running concurrently Um, and uh, we're going to switch to painting Um, right, we're going to do some painting. Um, and at some point we're going to you know, maybe finish painting Of a painting and now we have to wait for it to dry or we have to wait for you know For more paint to come in. Maybe we've run a paint and so we're gonna Uh Our task number one is stalled. So we're waiting on candy. We just completed some painting But we're maybe we ran out of paint. So we've we've ordered the office to bring us more paint And so we're going to stall here and Keep going eventually so long as we have enough activities to go. So this is time going downwards So long as we have activities eventually we'll get back To a point where we can resume our candy eating, right? Eventually a crate will come in and we'll be ready to eat again and we can switch to that activity So provided that we have enough of these activities that we can interleave Then we can always stay busy Our kids can always be doing something and sort of because there are so many of them again, we're going to get a lot more work done And so we'll while waiting for our drawings to dry Maybe we'll do some singing here and you know, I don't know jump around and get some energy out here Okay, so They got said this So and yeah, eventually, I mean we will finish everything But it will it will appear as though, you know, we worked So the total time that it took us to complete this one task is is Sort of great But by the end as we start finishing things up, uh, we'll, um, we'll Compensate for it by doing many many things at the same time and sort of waiting in between While we're writing on memory to arrive So the core ideas Uh, um in the gpu architecture is that we have many slim down cores or there's lots of parallelism There's a more al use and fewer control units And the way that we avoid memory stall is by interleaving execution of these single instruction multiple data groups So it's the teacher telling everyone what to do And and sort of when the teacher senses that uh, we're out of we're out of candy Then we're going to switch switch activities and do something else but but Sort of importantly We needed to have exposed enough Act enough tasks that we wanted the kids to do in the first place, right? So so long as we have enough to do Then um, we can sort of keep the kids, uh busy In this manner Okay, so so here's here's another way of, uh, sort of visualizing what's going on here. Um, And and again, so this is the uh on the cpu side This is the portion of the dye that's sort of doing the arithmetic for us And then you know, you see, uh, uh the control units fairly sophisticated and there's a lot of dye dedicated to cash On the gpu side, you have a lot more Of these compute units they're in green the al use and the the the control units and the and the cash if any is sort of Very small, okay, and and here's a real picture of also sort of dated but um This particular cpu can do four single precision operations at a time whereas, uh, uh this amd gpu because it has all these simd cores It can do 800 single precision floating point operations at a time and so Again, a lot of a lot of the dye is dedicated to cash There's uh data cash instruction cash a level two cash and very um This is sort of where the the floating point in the simd portion of the dye is Whereas the the bulk of the the dye is um dedicated to compute on the gpu so there's there's sort of some some benefits and some, um Disadvantages to to using gpu as the benefits are is the memory bandwidth so you you can have actually a lot of um Parallel access to memory is efficient because again remember that When um when sort of a teacher calls in a particular set of memory that memory becomes available to all kids So they can they can all share amongst each other. Um, a lot of the memory and the the the way that the cards were Manufactured it was in a way to to make memory access efficiently. Um That they can do compute The sort of the throughput the compute throughput is greater, but there are some losses, you know, you know You don't get anything for free. So um in particular, uh one thing to note is that there's no performance portability What that means to me is that uh, what's optimal on your card may not be optimal on my card So really a lot of these optimizations that you end up doing they may be sort of card specific The general trends will be there, but but for your for a specific configuration of a card and um It'll it'll be sort of less efficient than then another, um Maybe a next generation card will do Something that used to be an efficient will be way more efficient something that used to be the most efficient way May not be the most efficient anymore Um, then the data size also can affect sort of the algorithm design and and vice versa There's a straight off because you're really you end up, uh, one of the things that we gave up That in a few earlier slides is that we end up coding sort of at the low level We end up synchronizing our Are the the threads that within our classroom together we end up writing exactly what memory gets fetched and when And where that gets stored and we also have to keep track of how much memory how much totally memory is available locally So you end up when you do gpu programming you can get sort of an order of uh, sort of uh, like a 10 times Sort of trivially For for many applications just by doing you know doing the same thing in parallel. Um But to really gain the the 50 to 100 to 200 times speed up You end up having to know a lot about the arc the specific architecture that you're targeting and how the memory is laid out And what its limitations are And i'm saying this not to sort of not to sort of scare you but to Make you aware of what where it is that you end up having to spend a lot of the time But there are solutions and the solutions is because these things end up getting so complex you end up sort of Trying everything you end up trying to write your um write your algorithm and maybe a specific block size That's not hard coded But that that you're going to vary as a parameter and a specific sort of memory layout that's not hard coded But that will also be sort of Plugable and that's that's the power that sort of as um as we get to the tail end to the python side of it That's where that will come in um Okay So um open cl is this open computing language, uh, there's uh, sort of the predecessor to it is nvidia's kuda um, uh, opens, uh, nvidia's kuda is an nvidia specific, uh A general purpose gpu programming language, whereas open cl is a consortium of of many, um Companies including nvidia and so um open cl is sort of the the the the new kid on the block But it is the the more widely supported kid because nvidia in all of their cards supports open cl But open cl is also supported by the amd cards and there's also um cpu backends for open cl And what it is open cl defines the uh programming interface a library that you're going to use to write your code And the device side programming language that's gonna that's gonna sort of implement the specifics that the library is going to expose And it comes with What i started to allude to this real-time code generation RTCG aspect where you're going to be able to Define your problem in such a way that you because you don't know exactly which Architecture you're targeting or because you want to squeeze the most performance out of the specific architecture You're going to try lots of different things and benchmark them and see which one comes out the fastest for you For the particular sort of card that you're on and so, uh, this is just sort of a Open cl slide just to sort of convince you that a lot of the big names are behind this This isn't just a sort of a thing that's going away and it's uh, it's uh put together by the uh The chronos group is sort of the organ the organizers and keepers of the open cl standard And they're also the ones that maintain the open gl standard The web gl standard among others So the the the execution model is that For for open cl is that we're gonna our items are gonna be Sort of like the things that get taken care of by individual alu by an individual students The work group is the thing that gets taken care of by an individual classroom And then multiple work groups, uh, will will be laid out in sort of a grid where you have lots of data you can think of the grid is the the total number of Sort of the trucks worth of candy that you have to eat and that gets mapped out to different classrooms and the the kuda words for this are similar where um, instead of An item this would be like a thread running within some thread block Which is the work group and the the word for grid is the same Um, so the yeah the grid is all the work that we have to do for a particular activity Um, so this is just going back to to what we've done here. So we we have our alus. They have some context Some of which is shared some of which which is private. This is sort of their own work desks And while we're waiting for memory you we're gonna we're gonna work on something else And when that, uh So this is this is one classroom when we have multiple class classrooms Do we actually end up caring about how many, uh, Cores that we have Do we care about how many classrooms that we have and it turns out that when we define How it is what it means to work on our problem All we need to do is define how it is that a particular classroom is going to solve the problem And then whichever classroom gets it however many classrooms it is that we have available That's how many are gonna work on it. So, uh, who cares how many cores we have We're gonna program as if we have infinitely many cores and program as if there are infinitely many, uh, alus Because you know if we have four classrooms, they'll do the same thing But just 40 classrooms will will finish faster and you can think about you know, what's what's easier to do Is it easier to program? Uh, take a parallel program and run it on sequential hardware Which is sort of like oh bummer. You're not going to get you know, any Anything good out of your parallel program. You're not going to be able to squeeze any parallelism out Or what's uh, is it harder to take a sequential program and run it on parallel hardware? Like this this just does not compute, right? This is this is a very very hard problem Whereas this is just a a slow way of getting the solution So, um, so back to the the software representation side of things So this is this will be sort of a grid Of things and this is how how does it map to our classrooms? How does it map to our hardware? We're gonna uh In in sort of ngp talk we talked about having a kernel which is Something that we're going to do to the entire grid. We're going to perform some function Take some data and and crunch some numbers on it and we're going to in specifically we're going to take each of these work groups This block of a grid and then we're going to map it onto one of our classrooms and uh So that gets mapped to that and so these four get mapped there and so on And so you can see that that if your gpu card has you know 16 Classroom equivalents uh, and mine has you know 50 I'll just get done faster But all that we've had to do is program sort of teach it how to do one of these blocks What does it mean to do a computation on one of these blocks? Because the only synchronization that you have are within a block. You don't have sort of an explicit Immediate way of getting all the classrooms working together right you'd have to sort of have a Put in sort of one big block on your you run a kernel On on the data and you wait for the the kernel to finish on the whole grid of data before you can do some some next computation on it And so so maybe They they can only sort of work a few work units at a time and then they uh We're going to work through it and uh So um So really a group provides a pool of parallelism to draw on and the order within a group matters So remember, uh, where's my pointer? With within a group, it's going to matter that this item finished You know maybe before this item maybe this item depends on things before it And we can ensure that that that is the case because we have sort of classroom control, right Classroom level control, but we don't we don't have a way when we map a problem onto onto the full grid We don't have a way of ensuring that this block finishes before this block So if there's something in this block that depends on that block This programming model doesn't expose that and we would have to first do this block Then do that block and if this was the case for all the blocks And maybe you're not going to get any kind of speed up on the cpu So if it's some iterative thing that there isn't enough computations to do in parallel To begin with then you're not going to get a speed up But um, so order does not matter among groups So it could be the case that this block actually got started before this block It's just that you know, you can only know when they all finished and that's that's the sort of thing that you're guaranteed That's the sort of thing that you have access to Okay, so So here's here's some open cl code. So the because uh, so open cl is is effectively like kuda It's an extension to this C language, so it's sort of a library and it requires some Boiler plate code. This happens to be four slides worth of boilerplate code I'm just putting it up there. Just so you see sort of what it looks like and And and also to show you sort of what Pi open cl gets you because it gets you from these four slides. It gets you down to one slide um, so this is sort of setting up some memory um And the actual business of this whole kernel is just right here It's just on lines one and two where it's going to get some data a a pointer to a float of a and uh, uh Uh It's gonna based on its Global ID, which is a way of uh, you can think of a global idea as a way of defining a unique spot in that grid Which will be computed based on the work group sort of what classroom you're in and where you sit within a classroom We'll get you your your global ID And uh, then you're just going to multiply whatever is there already you're going to multiply that by two That's that's it for this sort of compute kernel. It's a times two Compute kernel. It's going to go through all the data And this is how we define it and when we actually run it is when it will map it across A large thing and we're going to look for Some errors and make sure that we release the context and things like that so All this boilerplate is fine. It's a little bit tedious, but You're going to say like do I really have to start using make files and get bogged down and see in this compile make file? No, because who are you going to call you're going to call andreas clockner Who's who ain't afraid of no code? because uh andreas clockner Wrote pi kuda and pi open cl so you wouldn't have to do this and so The sort of the four pages Even though they're not that tedious. They're actually even more tedious in The predecessor in kuda The the four boilerplate Pages end up just being this one thing where some of it is white space. Some of it is numpy, right? But effectively here's that same kernel that That I highlighted and this is our multiply by two And like nummy xpr Yeah, like nummy xpr the code that you pass to Open cl to pi open cl rather ends up being a string and this is actually a strength This gives you another entry point into this real-time code generation Because it's a string we have lots of ways of manipulating strings in python so that if there was You know, you have there's lots of templating Engines out there that the web guys have built to generate web pages really fast And now we can generate whole sort of classes of compute kernels that try different ideas in different order and sort of to different to varying degrees As templates and we don't have to We don't have to so i'm going to sort of speed it up, but we don't have to Worry about the make file because all we do is that we call build here and then that Is available to us as an object that we can call that kernel on That's what pi open cl gets us. So let me just sort of Run through these slides. So, uh, so we're going to do scripting for gpus Scripting languages like python aren't very fast But it turns out the cpu just takes a backseat in these in this gpu computations It's not doing a lot of the work anyway. And so And it's a lot easier to expose things as python objects as you all know sort of the python api allows you to To provide nice Nice ways of manipulating them and then when it comes to compute we'll let the graphics cards do its thing And I have just sort of two minutes. So pi open cl as I said is just a it's a it's an open cl wrapper in python It's mature. It has You don't have to write your kernels or sort of this is sort of getting to the practical aspects of things You don't have to write your own kernels Things like element wise operations are already built for you random number generators are there Some of the sort of more traditional Computational objects are there from from the gpu Programming world But it does allow you it exposes this really nice way of just giving Being able to pass just the code that does the kernel and then all the boilerplate is done for you and gets taken care of and sort of And it does integrate with numpy Said this so it gets you out of this loop You don't have to worry about compiling the linking you just edit and run your code Where some of the editing like I said because of this templating age Templating engine idea can happen within the program You can actually generate many many different instances of a compute kernel and then just run them and try them all And uh That's this this is what py kuda provides for you and also happens to be what the py open cl provides for you um So this is this is um There's many many slides in here about this idea of runtime code generation. I'm just going to sort of run through it um quickly I've sort of described this high level as this is this is where you're in the loop You're writing the python code well all the gpu codes and all this stuff That's not interesting to you and you'd rather just sort of butt out and let the let py kuda and py open cl do that for you um Yeah, and this this auto tuning idea isn't new so the the auto tuned linear algebra System atlas does this already the fft w also does this they just Generate and if you've ever had to build atlas on your computer You know that this takes forever because they just try all of these different combinations Just that are split that and to figure out which one will be optimal for your cpu And it's the same idea. It's just that the facilities are there sort of Upfront for and for the taking if you do gpu computing using either py kuda py open cl And so you can you can try things um Very quickly i'm i'm not a sort of a sophisticated program And I was able to get to the on the order of you know 50 to 100 times speed up on a particular problem that I was interested in Where I had been stuck just on the make and compile cycle before py open py kuda came around So that's sort of a testimonial and this is what maybe what a template would look like where the the twice kernel will Now it'll be a blocking thing and you can uh, you can Excuse me It'll be um You'll have parameters to get passed in as a block size and thread block size and uh, Some conditional unrolling. So this is all sort of specific to ginger Templating code but this this gets when a template gets rendered And you pass it the parameters and these things sort of this loop gets unrolled and you'll get you know Depending on what the block size was you'll get that many statements and they'll have the statement specific things substituted in there um And so you can render a template many many times trying out many many different Configuration and that's just you know trying out the templates is now just a for loop or sets of for loops within your within your python code, right and then You know that source module will be available and now you can use it and run it. So that's that's the last um slide and uh Last slide This idea by the way of code generation Is most typically done in the context of GPUs, but can do the same thing with CPUs So siphon has a method called dot inline that allows you to basically call a snippet of c plus plus or c or c plus plus Right within the current function and i've i've done that in the past with c plus plus same idea of basically Generating code that is matched to the parameters of my problem and Running it at the creation time of my object so that I can get maximum performance while still working in python So you should in general think that for performance You should keep in mind the back of your head the compiler whether it's a c compiler or a gp You are a compiler is another runtime tool That you can use to generate code which is adapted to use the specifics of your problem So thanks guys. Um, I'm just having trouble the acknowledgement slide. Will you send to chris the uh Yes