 Okay, so again, we're going to do some runs just to show you how to compile and how to run through our queuing system, and then I'm going to talk about some more advanced things, what's parallel computing, things like that. I'll start out by introducing myself again. I'm Dave Turner. I'm the application scientist here. Kyle Hudson and Adam Tigert are our system administrators. Dan Andreessen is not here. He's the director of the Baocat group. So I'm going to back up here and start by showing you an overview of what Baocat is. I know that Kyle talked about some of this on Monday, but I want to reiterate this because this gives you a big picture of what Baocat is and what its capabilities are. We're running Gen 2 Linux. That's just the flavor of Linux that we run, and a Sun Grid Engine queuing system. So I'll refer to this as SGE or the queuing system. When you get on the nodes, you log into one of the head nodes, Selene or EOS, then you use the SGE queuing system to actually run your jobs on the compute nodes. So SGE is your link to doing the actual computations. We currently have 3252 Intel Xeon compute cores in 145 nodes. 54 of them are the newest hero nodes. These have the Haswell processor. Each of these has 24 cores and somewhere between 128 and 512 gigabytes. It has 40 gigabit per second Ethernet on, but we're running what's called a low latency rocky or RDMA over converged Ethernet over that. So you get 40 gigabits per second with a latency of about one and a half microseconds. A latency means that even if you send a very small packet, it takes a certain minimum amount of time to send that information. So for small packets, small messages, your communication time is dependent entirely on the latency or the overhead to handshake. For large messages, your time is, your transmission time is totally dependent on the bandwidth you have or how much data that you can push over that communication link. We have 85 Ls, these are a few years old, but still very, very good machines. These have either 16 or 20 cores in them and a lot of them have 64 gigabytes of memory. We have a few that get as large as 384 gigabytes though. The networking in this is called QDR InfiniBand. InfiniBand is a custom network made for cluster computing. And again, that's very fast, about 30 gigabits per second and again the same low latency of about one and a half microseconds. We have six mages that are older, but still very useful for especially large runs, 80 cores per mage and 1000 gigabytes, that's one terabyte of memory. So if you need to run really large jobs, the mages are ideal. They're quite a bit slower, but you have all the memory there that you need. If you need speed and large memory, then the heroes are a good place to run, but there are fewer of those that have 512 gigabytes. We are also resurrecting the old paladins. We're going to be putting some NVIDIA GPU cards in. I'm not going to really cover programming with GPU cards at all, but just to let you know that they are there. We're also going to have Intel Fi cards in some of the heroes, Intel Fi cards or coprocessor cards that plug into the PCI bus. This again is more difficult to program, so I won't say a lot about these Intel Fi cards at all. All these machines are connected up to our file server, which has about two petabytes. They're connected via either 10 gigabit per second or 40 gigabit per second Ethernet, so fast interconnect. So that's just giving you an overview of Baocat. So hopefully you've all logged in. If any of you have trouble getting in, raise your hand, and Adam and Kyle can help you with that. Again, make sure you copy all the slides from the Baocat workshop directory on my home directory to the tilde. The tilde will copy it to your home directory. It's what that means. And then if you do a CD into that directory, that'll get you to where we're going to do all our work. I've also put a copy of these slides in the sub directory called Slides. I think I called them Baocat Slides. And I put them in a couple forms. I'm sure I put at least PDF slides. So if I go too fast here today and you're still trying to do some of the demonstrations, while I'm continuing to talk, you can pop up that PDF and see the directions. We'll be submitting other jobs, so let's skip past this. OK, so Kyle covered using Kstat. And I just wanted to reiterate some of this because it's the best place to get a good picture, a good quick picture of what's going on on the nodes. SGE tools have Qstat and Qaccount to see what's currently running and what has finished. Kstat kind of puts that all in one place. It presents it to you in a little bit better form. It colorizes things to tell you if there's warnings that'll start by warning you in yellow and then slowly progress to red, background red, flashing red. If you see something on your job that's flashing red, that's something that needs your attention. If you don't understand what's wrong, then you need to communicate with us. Every day I'll look at this and I'll email people if I see anything wrong. But please feel free to play around with this. You can start by doing Kstat minus help, and that'll give you a lot more information than what I wrote here. This bottom most one is also very important. This is probably the one that you'll use most often. Kstat minus Z gives you just a synopsis of what your jobs are doing. So it'll kind of get rid of all the other jobs on the nodes you're running on. And even in the queuing system, it'll just show your job in the queue. That won't show you where you're at in the queue. If you want to know that, you have to look at the full queue with the Kstat minus Q, for example. Other things that people, I don't think use much. Or Kstat minus C is a good way of seeing what everyone is doing. So if you do a Kstat minus C, it gives you a summary of who's using the cluster. Again, the color coding in this case is just if you're over 10% of the cluster, you get yellow. If you're over 20%, you get red. It doesn't mean anything's wrong. It just means this person, K, is using 27% of the machine. What is he doing? It just means he's relentless about getting nodes. So yeah, this is just a good way of getting an overview of you can see how many nodes you're running on and how many you have queued compared to other people. But I'll let you guys go ahead and play with Kstat. You can do that while I'm giving the talk. And if you see anything that you want to ask a question about, if you see any node that's looking odd, ask me now or ask me after the talk. And I'd be happy to tell you what's going on there. I will tell you that the core count is inaccurate at this point. We made a little bit of a change to the setup of the queuing system. And so you'll see some negative core counts in there. But the load level and everything else should be accurate. Ganglia is another thing that Kyle mentioned. Again, I don't use this as much as I did when I first started here because I use Kstat. But if you want to see what has happened over a span of time, Ganglia is a good place to go. You can get charts that'll tell you what happened over the last day on a given machine. So if you're curious about how your job ran, how many threads it's running on, if it's running on all the cores it should, if it's doing disk IO, Ganglia is a good place to look. Another thing you can do is look at a job with monitor node. To use monitor node, you have to actually have a job running. So if you want to play around with that, one thing you can do is start a job up. If you look in the qs.workshop file that I wrote up, there's a few simple files. One is called hello, for example. You can set it up to run that and then do a sleep afterwards. The sleep will just tell the core to wait a certain number of seconds. So if you do a sleep of 600, that'll wait for 600 seconds or 10 minutes. And then you can try out the monitor node. But what you do with monitor node is you give it the host number or host name that you want to run on, that your job is on. So if you use case.minus.z and see that your job's running on Hero35, for example, then you can do a monitor node of Hero35. And you can get in and you can poke around a little bit, just as if you SSHed in. What's typically useful is to run htop. I'll give you an example here by just SSHing in. So let's pick out an interesting node here. And let's just look at ELF 8. So I'm going to SSH into ELF 8 because I don't have a job there. But I'm the only one, well, three of us can SSH into nodes, you can't. So you can only look at nodes where you have a job running and you can get in with the monitor node. But once you're in, this is basically the same thing that you would see is you just logged in, you can do it in LS. You can do an htop and htop will just show you the processes that are running. Now, this is kind of messy. So the most common thing I do is I hit the U key and then I go down to a certain user and we'll pick on Kohal today. And we'll see what his job is doing here. And if you go down and look, these are just some of the setup scripts. Here's the bash script. Here it's running a Perl script. And you can actually see the function that it's running. It's a Perl script called aarex.pl with 12 CPUs. And if you count these up, there will be 12 of those subprocesses that are being used. So if you ever are wondering whether your job is actually running, this is a good way of doing it is to use monitor node and get in and look at what actually is, what processes are running. And also look at the CPU bars here. If your CPU bars are not 100%, that can be one thing that's a problem. If you ever see a status that's a D, that means that it's in a disk weight. You're doing lots of IO and we see some cases where if you're doing IO to the file server because our file server currently has some issues, you can see it getting into a disk weight state. And that's an indication that you'd be better off to copy your files to local disk and run from there. And that's something I'll cover later. But actually getting into the nodes and looking and seeing what the CPU bars are looking like and what processes are running on your nodes are actually very useful. And I don't think that a lot of our users are doing this right now. But I can give you a lot of information. Usually when someone sends me an email saying I'm not sure my job is running, this is one of the first things that I do is ISSH in. Okay, so I wanna start now talking about parallel computing. And go kind of an overview, I'll go kind of historically about computers. And at each stage, I'm gonna start out with scalar computers. And then at each stage, I'm gonna add one level of complexity and describe what happened historically. And by that way, I don't wanna dive clear into what our current parallel computers look like by building it up one layer at a time. I think it's a little easier to understand what each layer is about. So in order to do that, I need to go back to the good old days when we dealt with scalar computers. A scalar computer or a serial computer is simply a computer that has one processor or one processor core. And then memory with program data sitting in that memory. So conceptually, it's very easy to understand. And even from a programming point of view, it's pretty easy to understand what's going on. I'm gonna take an example of a vector addition here where you have a vector x plus a vector y equals a vector z. And I've drawn that out here where again you have these arrays. And with a scalar or a serial computer, you just think of this as doing one thing at a time. In order to do this vector addition, you start by simply loading x0, then you load y0, pull them both up into the processor, then you do the add, and then you do a store into the location for z0. Then you repeat for each element, you take x1, y1, atom, storm back down into z1. So this is pretty simple and easy to program. And it's actually how most people think about computers today, even though the way computers operate today is very different. The computers, the way they operate today, do many things at the same time. And they can do many things in many different ways at the same time. But still most people that program think of codes in a scalar sense. And so part of what I want to get across to you today is that, yes, you can think of it to start with this way, but you also need to start thinking about things in a parallel nature if you're going to use high performance computers and be able to understand at least a little bit about how to optimize and use them effectively. I don't expect you to understand everything about programming these systems because most of you probably won't need to actually do the programming. But just if you're using them even, it helps to be exposed to a lot of these concepts. So I'm going to start by showing you a scalar program to do the vector addition. This is in the C programming language. And again, if you don't know C, that's all right. Again, I don't expect you to understand all this. I'll just point out some of what the program does conceptually because we're going to, as we look at different types of parallel computers, I'm going to talk about the same vector add and show you basically how this changes in the different environments. So we're going to choose a vector length of a million. The first thing we're going to do in the main program here is to do a print statement saying, I'm starting my vector add. This is just allocating space one million times the size of an element, which is a double precision number for each of the arrays x, y, and z. So we're giving them space. We're initializing each of the vectors. x of i is just going to be i. y of i is going to be i squared. This is the entire place where we do the work, where we're doing the vector addition. So for those of you who haven't done programming, this is just a general loop. In the C language, this is your initialization, i equals zero at the beginning of the loop. This is your test. We're going to go until i, or as long as i is less than n, where n is the million. And at each step through the loop, we're going to increment the index i by one. That's what the plus plus is. So we start with i equals zero. We take x zero plus y zero, put it in z zero, hit the n bracket, we go around here, we increment by one. We test is one less than a million, yes. So we do x one plus y one, z one. And we do this until this test is not true. So we do it for one million iterations. We're going to do this. At the very end, I have another loop. This one goes from i equals zero to 100. But the increment is by 10. So it's going to do i equals zero, then 10, 20, et cetera. This is just printing out the final result, what z is. Just so that we have some output. So the first thing I want you to do is actually to compile this. So if you're in the bail cat underscore workshop sub-directory, I'm going to start you out with the Intel compiler. This is the Intel C compiler. ICC, and then you give it the name of the code, vec underscore add dot c. The dot c means it's a code in the C library, or in the C language. The minus o, and then this is the name of your executable. You can actually name that anything you want. That's going to be the name when you run the binary or the executable. So the compiler is taking the source code and converting it into the binary code that the machine uses to do the actual running. On bail cat, we have both the Intel compilers and the GNU compilers, which would, in this case, be GCC. In general, I would recommend the Intel compilers just because we're running on Intel processors, but both do a good job. So for now, do the compilation, and then in order to run it, you do a dot slash and then the executable name. The dot slash just means to look in your current working directory. Dot always means the current working directory. So if you do a dot slash vec add ICC, that should run that program. Yeah, slash will choose your directories or subdirectory. So if you look at the screen, I'm currently in my home directory. If I want to do an LS for the bail cat workshop directory, I would do a bail cat workshop and then a slash. Well, sorry, that doesn't work, I guess. So inside the bail cat workshop directory, there's a subdirectory called slides. So this command will look in the bail cat workshop subdirectory and then the slide subdirectory inside it. So the slash allows you to move between folders. So let me step through it here. So I don't actually need to do the dot slash in my case because I put dot in my path. So the answer does depend on how you set up your environment. And I've got a lot of stuff there. I don't see offhand where the other it is. Yeah, so I've actually set up my path environment so I have dot in my path. So I don't actually need to put the dot slash, but most people will. So I have a file called bashrc.mine that gets sourced from your bashrc file. But you could put this in. So actually this line here will add dot to your path as well as bin. So export capital path equals dot colon and then I also add tilde slash bin colon into your path. So you'd want to put that in your dot bashrc file. It can go in a bash profile also. Okay, so I went ahead and compiled. Vec add ICC. So now if I run it, it does the vector addition. And while it starts by printing that out, it does the vector addition, it prints out every 10 increments up to 100 even though we're doing a million. I only wanted to print out 10. Okay, so that's compiling with the ICC compiler. The GCC compiler is the same. Again, this is the GNU C compiler. You do the same source code. I'm going to name it something different. Again, you don't have to name it this, but I'm just putting the underscore GCC on the end. This case I'm putting a time in front of it just so that I can do a time and see how long it takes. Now the time function you can put in front of any executable or any command and it'll give you the real user and system time in general. Just look at the real time. And this took about 42 milliseconds to run. So I just wanted to introduce you to that. It's useful to use that in your Q subscript times so that you know where the time is being spent. Other people put a date command in front and behind. So the date command just tells you the date and then you can subtract between those. But just putting a time in front of your executable is an easy way of understanding how much time is spent in that and if you're copying files in or out you can put a time in front of that as well. So the other thing you can do is edit the QS Workshop script. This is a Q subscript to run on one core and then to do this executable and then you can run it on one of the nodes. So let's look at that real fast. So again, this is mostly stuff. It looks like all the stuff that Kyle went over on Monday. Here we're choosing an environment that's a single node, one core on that node. What I have here is to do the hostname and echo hello world. We actually want to run these two things. So I'm going to take these two. So I'm editing this in VI. You can use whatever editor you want. Kyle showed you a little bit about Nano. I've just taken this, everything above the exit line will get run. So it's going to print out the hostname and then it's going to time the VEC add ICC and then time the VEC add GCC and I'll echo out some comments beforehand. Now I'm going to Q sub it and again I don't have to put any arguments in there because I've put them all in the QS. Workshop file. Now we'll wait. So now you see it's running on Elf 81 looks like. Where does it say what? Say that again. Where did it say the core name? The hostname. So there it's complete. Now we look at the directory I'm on. There's two new files, workshop.o and then the user or the job number and then workshop.po. The po is always empty. So this is our output. It's dumping out the hostname Elf 81 then it's saying running ICC version. This is actually printed out by the script itself from that echo command I put in there. Then this is printed out from VEC add ICC and this is the time it took 16 milliseconds. Then again this is from the echo command I put in the Q sub script. This is from the executable itself and then this is the time 15 milliseconds in that case. So this is the output from the run. Now they're the same speed. This is not an accurate enough time to judge. It's doing a million things but it's doing a million things that are not much. So in order to get a more accurate time we would have to give it more work and on simple tasks like this both compilers should give you pretty optimal code. The Intel compiler, again I encourage you to use that more often just because it also has better optimized libraries. The Intel math kernel library has its own optimized versions of the BLOSS library, LAPAC, ScalaPAC, FFT. All those libraries all wrapped in one. So it can be easier to use from that point of view too. We're currently on version 15 and 16 is the state of the art and we're working on getting it upgraded. Okay, so this is just showing you what a scalar computer is like and I wanted to get you some hands-on on how to do some basic compiling and submitting a few scripts. Later on I'll show you how to do some installations of real software. It's important that you understand how to compile even a single piece of code before you work with make files that compile large numbers of smaller C files or source files. Okay, so after scalar computers Seymour Cray came out with a new technology called vector computing. Vector computers, instead of using silicon chips used gallium arsenide chips so a whole new set of technology and the biggest thing conceptually is a vector computer instead of working on one thing at a time in this case worked on 64 elements at the same time. So again with the vector add we're going to go over the same vector add but in this case instead of doing things one at a time 64 at a time. So this is a picture of a larger amount of memory. I drew the memory bus up to the processor as being much wider because instead of pulling one ray element at a time up you're pulling things up 64 at a time. So you pull 64 elements of X up you pull 64 elements of Y up then you add 64 things at the same time and then you put 64 elements of Z back down into memory. So that's the general concept behind vector computing. Then in the second step you go to the next 64 and then the next 64. So it's not a big jump conceptually from a scalar computer it's still fairly easy to understand what the compilers do is when they vectorize they look at the innermost loop of your code. Now in our vector add we only have one loop so it's just looking at that one loop but if you have like a double or triple loop it just analyzes the innermost part of the loops and tries to vectorize that to see if it can do the same thing 64 at a time. There's cases where it can't. There's cases where some elements depend on a calculation from the previous part of the loop and it just can't vectorize. But potentially you get a 64 time speed up with these machines. These are some pictures of the Cray vector machines these were sometimes referred to as love seeds because some of the computer is up here and some of it is down here but these are actually padded seeds that you can sit on kind of like a love seed I guess type of one. So real expensive tens of millions of dollars but these were the state of the art now. Your phone is faster than these. Your phone now will do a lot more computations than what tens of millions of dollars worth of hardware could do before. Now when you're doing things 64 at a time it's important that every loop is vectorized. This is true with parallel computing in general. So I'm just going to take an example application. Let's suppose the application has three loops the first one taking 30 seconds on a scalar machine the second taking 20 seconds the third taking 50 seconds. Well let's say that the first and second loops vectorize but the third loop doesn't. That means the time taken for the first loop is going to be 30 seconds divided by 64 so we get a 64 time speed up. Second loop is going to be 20 divided by 64 but the third loop is still going to have that 50 seconds in it. So the total time goes from 100 seconds down to 50 seconds well that's doubling the performance of your code. Okay but we would like to see closer to a 64 times increase. So what I'm trying to get across here is in order to get your code to vectorize or parallelize and get a good speed up every part of your code needs to be vectorized. It's that last little bit that has the trouble that will cost you in the long run and that's one of the things that makes vectorization or parallelization difficult. So again if all of them are vectorized then you get 64 times speed up this may not there's other reasons why it may not be 64 or maybe you get a 60 times speed up that would be pretty good in most cases but at least you're getting the majority of your potential speed up. Okay so the main reasons if you have a print statement in there it won't vectorize. Conditional statements make it hard to vectorize because you're doing 64 of the same thing each time so if you have a conditional in your loop you're really saying if this happens go through this instruction set otherwise go through this instruction set well you really need to be doing the same thing at each step so the only conditionals that can handle are the simplest ones. If you have conditional branches those are tough if you some cases each time you go through a loop you're building on what happened in the previous part of the loop the previous iteration and that won't vectorize at all because each iteration is dependent on the previous one so you can't do them all at the same time you have to do them sequentially. Okay so Gehle-Marsini technology was developed entirely for vector computers very expensive and eventually people figured out well there's a better way we're building all these small computers we have the economy of scale we know everything about silicon let's just keep pushing that technology forward but what can we do to work on these big problems and in the late 80s and early 90s what we started doing is building what's called cluster computers and this is where we start getting into something similar to what VeoCat is cluster computer is many computers networked together so if we wanted to we could take all your laptops network them together well we have wireless in here we have Wi-Fi ports we could do this Adam, ten minutes you want to wire these all up we needed things like a common file system little difficult since he doesn't know your passwords you can crack all their passwords in a few minutes can't you I have over confidence in your abilities so a common file system and then you need software that would allow you to communicate and that's commonly what's called the message passing initiative or MPI software but we could actually make a cluster computer out of your laptops here I've taught a couple workshops one down in Mexico City where we actually did that it was a two week long workshop we had 40 people not with laptops but with wired in desktop machines sitting there so we just had we just set up everything on the spot to be a large cluster for them so lots of desktop workstations anything like that you can make into a cluster computer these days it's easier to just go out and buy a cluster computer rack mounted everything so again many computers network together the computers are cheaper because you're using the same we're basically using the same processors in our Baocat nodes as you can get in your laptops just a little higher end a little bigger memory more cores stuff like that so this is a very cost effective way of getting more computing power at least now there are catches the programmer must distribute the data across the compute nodes so whoever the programmer is has to sit down with their application and decide well how is this going to go on all these computers with parallel computing you're actually running typically the same program on each computer so I've just showed I've just diagrammed here a two node or two compute node cluster if you run a parallel program on this you'll run one program here you'll run a second one here it's almost always a copy of the first one so you're doing the same program each side will just operate on a different part of the data the computers must usually exchange data in order to perform most calculations and that's why you want a fast network in between so again let's take the simple case of our vector add now I've been doing this for a long time so I've decided to put all the red array elements on one compute node computer zero all the green odd ones on computer one by doing that x zero y zero and zero are all in the same place so that computation can be handled by computer zero without any need for any communications at all so basically computer zero is doing half the computations computer one is doing the other half you don't have to communicate well that's a good thing if I wanted to I could have given the first half everything above this to one computer and everything below this to the other same difference there's no need to do communications for this simple algorithm because I divided the memory up between the computers correctly now if I put x on this computer and y on this computer and wanted to do all the z calculations here I would have to do some communications that wouldn't be the better way to do it because communications cost you time and then you wouldn't get a 2x speedup you'd get something less than a doubling of your performance by using two compute nodes this is too simple of an example but I want to show you that distributing memory across the computers is the first part of determining what your algorithm does another simple algorithm to do on a single processor is to do a matrix multiplication that's all it is is right here to do a matrix multiply then written out as this to calculate z0 so i and j are both 0 I need 4 multiplications summed together so to get z0 z0 0 this element I need x0 0 times y0 0 times y0 0 then x0 1 times y1 0 so that times that and x0 2 times y2 0 so that times going down this way going across this way and down this way and then x0 3 times y3 0 so that times that so I've gone and divided this up amongst 4 processors right now for computers in red I've started out on computer 0 green blue and lavender I guess I'll call that this is very simple if you're doing it on one computer where all the data is in one node you can write this up and basically a double loop or one loop sorry a loop over i and j and then an inner loop over k so it's a triple loop but it's this much coding when you try to parallelize this it gets very difficult fast and this is where there's been a lot of experimentation done with many different algorithms to minimize the amount of communications one thing you might try or think about doing initially is saying well I'm going to do all the z calculations on my home node but I'll just start by having all the x elements set to every other node so every other node has a copy of the entire x-ray and do the same for y it has a copy of all the y elements then I'm done with my communications up front and I can do all the computations and at least they're divided up well you're doing a lot of communications up front if you go on more than 4 nodes if you start scaling up to thousands of nodes you're yelling at all the other nodes up front and you get a lot of contention for communication channels the other thing you run into is these are 4x4 matrices what happens when they're 10,000 by 10,000 size you don't want to redundantly store those matrices on each node you want to divide those matrices amongst all the nodes and then just pass them around a little bit so you don't have to redundantly store all the matrices on each node and you eventually exceed memory in most cases so this is an algorithm that doesn't redundantly store the matrix sub blocks and it does the communications while computations are being done if you're doing communications at the same time as computations you can hide that communication time behind the computations so while each computer is doing the computations the network can be doing the communications to get ready for the next set of computations and if you do that and do it well then you may not see any effect from the communications at all you might see a forex speed up in this case and you might be able to scale up to thousands of nodes without seeing any degradation due to the communications I'll just go over very briefly what the algorithm actually is just to show you that it can get complex in order to really understand this you'd have to get a pencil paper out and diagram it out and work your way through it so I don't expect you to understand all this you'd understand a little bit about the complexity of what some of these algorithms do so the first step in doing these communications so we start with the data as it is shown here the first step in this algorithm is to broadcast the diagonal blocks of X down each row so what are the diagonal blocks this sub block in red is on a diagonal this is on a diagonal off diagonal sub blocks so this sub block gets sent down to all the nodes in its row well that's one so we're sending this data to computer one this data gets sent to all the nodes in its row well that's there that means it's just sending it over to here so now we're replicating one sub block only not the entire X matrix the next step is to multiply X blocks with each nodes Y blocks that you have available so currently on this node we have our own X blocks block sub block and the X sub block that came in and if you go through what we want to do is for each of our Z values see which ones we have data in the same place for now so again for Z zero Z zero we can do X zero times Y zero we can do that before we can do X zero one times Y one zero we can do that before we also need to be able to do X zero two times Y zero two well we can't do that one yet still but there are going to be some other ones on this node that we can do so let's go to the next step now we're going to shift the Y block to the above node so this Y block is going to be shifted above it well that means it gets wrapped down since there's no above we're going to cycle it and go down this Y block gets shifted up here and then we're going to see if and the same over here and then we're going to see if there's more contributions to our Z elements that we can do and by this time the Y two zero we need to go with our X sorry X two zero is our X two zero and zero two we're already shifted over here and our Y two zero and Y three zero we're shifted up here so now we can do the last two elements of Z zero zero everyone got that straight no one confused my whole point in doing this is to show you it gets complicated very quickly this is the very efficient way of doing a matrix multiply that's very simple to do on one compute node but very difficult to do when you start doing message passing you have to be thinking in parallel you have to be thinking where is my data at you have to be thinking how do I get it to the node how do I get all the data from you in order to do the calculations at the right time so actually doing the program takes a lot of practice again I'm not trying to teach you how to program I just want you to understand a little bit about the complexity so I also want to show you some of the actual commands that do the message passing again doing the message passing we have a library called MPI or message passing initiative all these that are highlighted in yellow are MPI commands and this is how you manually communicate from one from one program on one node to one to a program on the other node so when you start up a job that's a message passing job there's a script called MPI run MPI run is just a script that launches all the jobs on all the computers that you tell it to but then you have programs running out on the various computers the very first thing they need to do is they need to know about each other so the MPI in it is the way that they handshake and say hi I'm over here on node such and such the next two commands are the way for each of the programs to get the number of processors and put it in that variable and which processor number they are so if you're running on 16 machines N procs will be 16 my proc will be a number from 0 to 15 so each of the different programs that are running the same job will have a different processor number I use the variable my proc a lot of MPI will use the word rank as which rank are you these three functions are at the beginning of every MPI program so if you look at code and see anything with the MPI underscore it means that code is doing message passing and when you compile it you're going to have to compile with MPI CC or MPI Fort for a C or Fortran compiler and you're going to have to run with MPI run there's also an MPI exec command it does the same thing but other than that this is a C program it does a lot of the same things in this case what this program is designed to do so we're going to run this on two nodes two computers and we're going to send a message from computer 0 to computer 1 and then from computer 1 back to computer 0 and that's it so this is our message I'm going to send the value in this variable which we're going to start at 0 now again as the programmer it's my job to divide up the workload and the entire workload and data distribution is always divided based on the processor number because that's the only thing different in the code on the different computers the code is the same this one variable is how you differentiate those runs to decide what part of the data is going to be operated on or what do you do differently so in this case if myproc equals 0 I'm going to do these commands if myproc is 1 the second processor I'm going to do these commands instead otherwise both jobs run through here and do everything the same up to here and everything the same after there is that one will run through here the second will go through here so the first one does a send from again this is processor 0 to the destination which is 1 it's going to send this variable 1 integer to the destination then it's going to sit there and wait for a response this is a blocking wait when you hit a wait it just waits forever until it gets a message back so the second computer skips all this and goes here it's going to start and block on a send a message coming in from the first node source 0 it's going to wait for 1 integer from source 0 and so it will just sit there and wait until the first computer, computer 0 sends a message out then it receives it it's going to increment that variable 1 by 1 and then send it back then computer 0 can finish and gets its message and they both print out what the final value of that variable is so again over here is our two compute node after here this again shows the order of everything is the numbers anything in red is what computer 0 does computer 0 starts with the token the variable set to 0 it sends it to processor 1 then it hits the receive so it has to wait there it can't do that yet the third thing that happens is the receive from process 0 of the message the message is just a variable excuse me and then it increments it by 1 so now the variable is 1 and it sends that back to process 0 computer 0 which is on this computer and then that finishes up its receive and then they can both print things out so the message gets sent this way and then the second one comes back here and it's done questions okay let's actually compile that and run it then again when you compile an MPI code instead of going directly with the Intel or GCC compilers you compile it with the MPI it's a wrapper around the Intel or GNU compilers so these still call the Intel or GCC compilers but they also set up the include files or include paths and the libraries that you need for any MPI program so for a C program you use MPI CC and give it the source code and then again we're going to name our executable and you can run this to start out with just on Bailcat on the head node so let's go ahead and do that so here it's going to compile it so now we have an executable called token pass now we can't just run it well okay we can but it's only running on one thread then so what we can do is then give it 2, tell it to run on 2 so now it's saying 4 net after net hello from 0 of 2 hello from 1 of 2 so those are just some print statements I put in there and then at the end it's saying process 0 has token 1 token equal 1 so the variable token is set to 1 process 1 has the token equal to 1 as well so in computer 0 we never did increment the token we did increment it in computer 2 the way that it got set to 1 for computer 0 is that we sent it to computer 2 it incremented it and sent it back so that's showing that yes we did get it back correctly from the second node so a lot of you when you get to running actual applications will need to be able to use MPI Run and things like that so again I just want to expose you to these things you don't have to understand all of the send and receive commands and things like that I just want to make sure that you know when you see something like that that's doing message passing okay I've showed you 2 and 4 node clusters it can get a lot more complicated than that you can get different network topologies anytime a lot of network switches will get you up to around 40 ports so if you have 40 computers you can plug them into the same switch and say fine you don't have any network topology all of them can talk to each other at the same time without any contention for communication channels you can also make more sophisticated networks that are also non-blocking out of smaller ones let's skip that for now the other thing you'll see if you move up to larger supercomputers is actual network topologies like this where each of these is a computer or a compute node and this is a 3D grid network so each node has 6 communication channels in a 3D mesh so you talk to the nodes above and below in front and back and to either side a lot of these types also have the wrap around feature so if it's on an edge it wraps around and talks to the one on the far and that's called a 3D torus instead of a 3D grid so this is a picture of blue waters blue waters waters is a computer at NCSA in Urbana-Champaign a $50 million computer I think it's got around 100,000 nodes a lot of what we've talked about is dividing work up among just a few nodes if you have to do data layout on a very large system like this you have to be careful that you're not talking from a node way on one end to the way on the other end because there's just not enough communication pathways to handle all of them shouting at each other so if you did something like a 3D FFT where you have to transpose the data it just would not scale well on this on a system like this it only really works well if you localize your communications so if you're only talking to your neighboring nodes you can talk to they can all do a shift simultaneously as long as it's just to your neighboring nodes a matrix multiply does that the algorithm that I briefly showed you is doing shifts to the neighboring nodes it will cycle through them but each shift in itself is just to the neighboring nodes so that's an algorithm that would actually map well on these larger systems okay so let's continue on what I talked to you about is running on multiple nodes around 2003 computing changed again and instead of building clusters of large single processor machines we started seeing dual processor or dual core machines come out around 2005 it was quad core well now we're buying machines with 16 cores for the elves 24 cores for the as well as the heroes so you get a whole lot of computing cores in each box so each of these is essentially a computer in itself in this case you're sharing memory though if you run a multi processor job in this environment you're still usually using the same code for each of these processing cores it's again just operating on a different part of the data and again since you can do shared memory within a multi core computer all those processing elements can be attacking the same area of memory at the same time rather than having to pass messages around now you can still do MPI and do that where each process would have its own area of memory and in that case instead of sending messages around over a network you're just doing memory copies between the memory space for the different processes so it's fairly efficient it's convenient in that you can do between nodes and then use that same code when you're working on a multi core computer where if you do the shared memory memory shared memory programming that works only within one multi core node it doesn't work when you try to go beyond one OpenMP is the main way that you program multi core systems if you're not going to use MPI this is a lot easier to program since again you're not manually passing around messages it's fairly efficient since again all the data is not being moved around but the limitation is you can only do this within one compute node with OpenMP you start one job but it can then spread over many threads with each thread running on one process typically this is an example where you start a code then when you hit a loop where there's a lot of computations to be done it'll spawn off many threads so if you're running on 16 cores you'd run typically 16 threads then when it's done with that loop it'll collapse down to the master thread and do the scalar sections next time it hits a loop with a lot of work again it'll expand and put one thread on each of these processes processing cores and each one of those will work on a different part of the loop it'll collapse down so again that's kind of what happens in OpenMP and it's really fairly easy to do this type of programming this is an example of again the vector add code and again I'll show you things that I changed one thing I did here is we're setting the number of threads with an OpenMP set num threads to four so we're going to set our default number of threads to four we're going to print out how many threads do I have here when we run this you're going to say when we get here even though we set the number of threads to four this is going to tell you that there's only one thread running this is because we are not in a loop yet we haven't hit that part where we want to spawn off additional threads this is the same allocation and initialization here is the one main difference here this is where we tell it with a pragma to do OpenMP parallelize this for loop the loop is the same we're just telling the compiler that if you're doing OpenMP spawn this off onto as many threads as you're given which in this case is four now I put this in here so the first time through it's going to print out how many threads are we using within that loop and then the end is the same so again when let's go ahead and compile this when we compile this use either the ICC or GCC doesn't matter which but we have to apply in the appropriate library if you do ICC you have to put a minus OpenMP if you do GCC minus F OpenMP so I'm going to use the ICC version and when I run it you see that the first thing it says is we're even though we set the number of threads to four it's only using one since it's not in one of those loops where we gave it the pragma and told it to that amongst four but when it does the actual vectorization or the parallelization in that loop it actually is using all four threads okay the other thing you can do with this is if you comment out so I manually set it within the code to use four threads if you comment that line out with a double slash then it won't automatically set any number here and if you run it it'll just default to eight on the head nodes but you can set it with this environment variable two so if you comment this line out recompile it and run it again this will use eight threads but then you can set this environment variable to two and this will use two threads so if you're running multi-threaded stuff you in general don't want to set the number of threads in your program when you use your Q-sub file that'll automatically set the number of threads to the number of slots or number of compute cores that you ask for so this is one thing that will come up if you're doing open mp jobs that you're running okay I see we've already lost Adam alright so we're getting closer to what Baocat actually is we've talked we've gone from scalar we've talked briefly about vector then parallel computing between nodes then multi-core within a node well the next step is to combine the last two and do multi-core and multiple nodes and again in order to do this if you want you can do MPI between nodes and MPI between cores of a node so that's maybe the easiest because you only use one level of parallel programming in that case this is done a lot of times because more for historical reasons you'll see a lot of codes that have been around for two decades that were programmed in MPI because OpenMP wasn't available now they run fairly efficiently even multi-core so people just don't put the effort in to do OpenMP on top of MPI and again this is a better picture of what MPI within a node would be if you ran on these eight cores with an MPI job each core each job on each core would have its own area of memory and when you message passing you just copy data between those areas of memory if you want though you can do MPI between compute nodes and OpenMP within a compute node this is the most efficient it's called hybrid computing because you're doing two levels of parallelism so more work but that's probably the optimal for efficiency one more layer of complexity so things are getting more complicated but we're actually getting close to what Bayoucat actually is we're back to vectorization so all that stuff I said about vectorization before well now it's starting to show back up our ELF nodes can do a vectorization of vector length of two doubles so they can do two double precision numbers two double precision computations at the same time the Haswells which are our hero nodes can do four doubles at the same time so if your code can vectorize that means either it's automatically vectorizing or you get in there and make it vectorize so it's doing four of the same thing at the same time you get a factor of four speedup the Intel Fi processors that I mentioned have 512 bit units 512 bits for a 64 bit double that allows you eight doubles at the same time the Intel processors that will be out in about a year are going to have these same 512 bit registers so within a year your code needs to vectorize or you're losing a factor of eight that's enormous the big problem is that right now almost no codes are vectorizing well of all the scientific codes that we've run on VeoCat if anything was automatically vectorizing we should see a factor of two difference between the runtime on the ELFs compared to the heroes and we're just not seeing that they're getting about the same time so almost nothing is automatically vectorizing the next round of processors the hardware will change so it'll make it easier for things to automatically parallelize or vectorize I've gone through and tried to manually vectorize a few codes you basically have to make sure that all your big arrays are aligned to 64 bytes so in C there's some help in doing your mallocs to align things to 64 bytes but then before each loop you have to tell the compiler that those arrays are properly aligned you may also have to tell it give it some help if it has trouble and I've tried working with this I've found that the Intel vector compilers are buggy and I haven't had much luck with it so I'm hopeful that when we upgrade to the version 16 of the Intel compilers that they're better I've also got access to an expert in Intel so that I can report bugs to them and if they have any advice for us they can pass it back if you want to actually try this on a simple code like the vector ads try doing this with the code being the vector ad this will generate a report I included that report somewhere this is what you get as far as help from the compiler when it gives you a vector report and just quickly it's actually doing something useful it's saying that it fused the loops at line numbers 19 and 29 I have a setup loop then I have the loop that actually does the vector ad and it fused those together automatically now this is an example I didn't want it to do it I should just put a print statement in there to keep it from fusing but that's fine whether each array is aligned or unaligned it does actually say fused loop was vectorized entire loop ok let's ignore that it can give you information about striding here's where it actually gives you an estimate of the speed up scalar loop cost is 23 whatever 23 units is the vector loop it estimates is 9.5 so it estimates a speed up of 2.4 not sure how it does that because I didn't really tell it if I'm running it on an L for a Haswell I assume it's just taking the head node information but the head node is what it's not a Haswell yeah so it should be a factor of 2 max so I don't know how it's getting that estimate but this gives you some idea of the information it gives you most you'll look at that and say I don't want to ever see that again if you want to vectorize your code the proper answer is to email me and I may tell you well it's not ready yet if you want this is so the best approach is there's a lot of codes where the developers are putting in effort to vectorize them I haven't heard of a lot of successors yet but there's a classical code called Gromax that is one of them that's receiving extra attention from the developers and from Intel as well as a few others that we use around here there's a biology code Mazurka or something like that that's also getting a lot of attention so there's about 20 codes that are getting a lot of attention but Intel is a little behind on this and everyone needs to be vectorizing as of a year from now okay so anyway again I just want to expose you to this stuff so that you know what the current status is on some of these issues that you're going to be needing to at least know about when it comes to optimization I really won't spend much time on this I put in there more for information when it comes to optimization this is more information that I need to know or I need to deal with this just shows you that before I was showing you a vector or a processor and memory there's actually many layers of cash that the system automatically manages in between the processor and memory the closer you are that you are to the processor the faster it is the memory is but the more expensive it is and the smaller the amount is so you can see the speed here only 128 gigs of L1 cash but it's 700 gigabytes per second to get it up here clear down in main memory here you can get close to a terabyte of memory on the Haswells but it's 17 gigabytes per second to get the data up so a lot of optimization is trying to keep the data that you're using up here and reuse it as heavily as possible before you have to shift it out okay and just briefly again I'll reiterate the goal of all this parallel computing whether it's vectorization or is to try to get an end time speed up on end processes that doesn't happen a lot inefficiencies can come in quite often from communications so even if you're doing multi-core stuff if you're running on all 16 cores you're sharing the main memory bus so even if you're running fairly efficient code since you do have a shared resource of the shared memory bus that can cut your efficiency if you're doing multiple node stuff your communication between nodes is going to be your primary source of inefficiency if your load is not balanced that can get in there but ultimately the goal is ideal scaling I think we can skip that too that's just some ultimate performance that I talked about earlier so one of the common questions that we get is how many cores should I run on when you start your job one of the first things if it's a new job a new application that you haven't used before one of the first things that you should probably do is do a scaling study measure the runtime of the application on 1, 2, 4, 8 and 16 cores then you can make a graph like this so again in each run you do a time of your executable and this is just a sample let's say on one core it took 10 hours on two cores it took 5 hours well that's awesome that's a 2x speed up on two cores but if you run 4 cores and it takes 2.8 hours that's less than ideal scaling but it's still pretty close 3.6 times we would like to see it at 4 but that's very reasonable then you go up to 8 cores you're down to 1.5 hours that's a 6.7 times speed up on 8 cores that's still fine if you go up to 16 cores 1.1 hours and it's down to 9.1 time speed up well now you're getting to only about half efficiency and I would probably recommend not using quite 16 cores your runtime is going to be faster if you use 16 cores it's going to take you longer to schedule and it's going to be more wasteful of the resources and we have to share the resources amongst others so we'd like to have you above a 50% efficiency you know closer to 70% is good the other thing you do when you're starting to run a code is you may not know how much memory it takes about all you can do is run it once and look with Kstat Kstat will tell you the actual amount of memory that's being used and if it's more than what you requested kill it off, ask for more and restart what you can do when I write a code I usually write in an estimate for how much memory it uses but almost no one else does that so you really have to just do it by trial and error another thing, yeah so that's a very good question mostly the reason is I'll go back to the slide that I skipped there are very small graphs I realize that this is a very typical graph for the communication performance between two nodes so down here is for small messages you have the latency so no matter what your message size is down there it's going to take 1.5 microseconds so the size of the message doesn't matter too much up here the size of the matter is totally because it's the bandwidth that's limiting you for the large messages the latency no longer makes a difference in between you get a little effect from both so when you start running on one or two nodes you're running in this area where you're bandwidth bound so if you go on more nodes if you go on twice as many nodes you're cutting the computations down but you're also cutting the communication time by a factor of 2 because your messages are half as large that's going from here to here as you keep dividing though you start falling off this slope and getting that latency head in there now as you double the number of nodes your computations are getting cut in half but your communications are no longer getting cut completely in half because of the small message latency is what you're hitting so that's usually the case for what happens with the parallel efficiency the parallel inefficiency for going for larger numbers of processors same effect for different reasons you're still with OpenMP you have 16 or 24 threads all accessing the same memory now if it's the same memory page only one of those threads at a time can access the same 4 kilobyte memory page so it depends on how the application is written up but you're also sharing the memory bus so if you have more and more cores accessing that same shared memory bus so again if you have more cores sharing the memory bus there that's where your contention is going to come in and you're going to get less of a fraction of that performance another good question almost two hours in and you guys are still listening that's a good sign only two more hours to go we do have this for I am kidding I didn't want to see too many words looks we have to listen today for two more hours okay so this is one thing I do too is if you come to me and say why is my code running slow if it's an important code and I think I can do something to help it out with I will put in my own profiling I have my own profiling code that's just one C library you put it include file in there then around the sections that I want to time I put a time start and the section name and a time stop and at the end I print out this report and the code does all the calculations to accumulate the times being spent in each function that I put that around so it prints out at the top the most heavily used one in this case this is the bioinformatics code called abyss tells you how many times it was called tells you the name of the function and the amount of time so this can really help me to determine where time is being spent so that I can concentrate my efforts on increasing the computational time if it's area of communications I can track that so this is one thing that I can put into codes that help me and it's done in a non-intrusive manner there are external commands like applications like G prof that will do this but quite often it's intrusive and by that I mean if you run an application and then run it under G prof it will automatically do this but the G prof run might take seven times as long as the actual run which was the case with this code that means the data is useless if it's adding seven times the amount of run time okay talk just briefly about optimizing input and output so as Kyle went over on Monday we still do have some issues with our file servers some of it is the interactive use some of it is also if you're operating on large data files input and output files you can at times get the system into where it's hitting that disk weight that I mentioned when I was talking about H top if that happens then what we can offer is to have you move your large files at the beginning of your run to the local disk on the compute node and that's fairly easy to do again this is all the stuff that you've seen before in your Q subscript all you have to do is copy your input file or files to dollar sign temp dollar sign temp is a directory on the local on the compute node that your job is running on this will only work for jobs that are on a single compute node so again if you copy your input files to dollar sign temp then when you run your application whatever you do to list your input files you need to list them as being in dollar sign temp and then the name of the input file some will take a directory etc you also want to do the same with any large output file so you copy those to your local disk at the end this is very important at the end of the run you want to make sure to copy those output files back to dot which is your current working directory if you do not copy things off the local disk when your job ends that gets deleted so we immediately purge that space so again copy stuff in run your application copy stuff out when the Q subscript returns that gets deleted automatically even with our file server problems we've really never seen where this is slow or this is slow and then this will be fast there's about 600 gigabytes of disk space on most every node so we've only really had one user where that's been a limitation we've just set up another thing called scratch a lot of systems have this you do a make directory of slash scratch and then your username that'll be scratch space that you have access to you do this once it'll stay there things that you put into scratch may be purged every 30 days maybe purged in some manner to be determined we have policies already in place okay so again then you can externally now now just from the biocat command line you can do whatever you want this is just another directory that you have control over you can copy your input files there you can CD over there you can manipulate your files in place there it's much easier to work with then when you run your script you don't have to copy things in and out each time you can if you want you just run your application and then whenever you want to access those files you just tell it to do it from scratch Dave Turner or your username and then that file name so then you're giving it the full path but it's sitting in scratch but this is much more convenient this does allow you to work from using MPI jobs as well because all nodes have access to this any questions about that I will give you one warning if you do copy stuff in there make sure not to do a copy with a minus P a minus P preserves the creation date or the modification date if you copy in things with a minus P for the preserver to preserve the modification date and if the last time you modified that file was more than 30 days ago the scripts these guys set up we're going to delete that okay never mind they don't tell me anything so that'll make it a lot easier okay the last thing to cover is software installation in general so first of all we do have some scientific applications installed Kyle told you how to get a list of those but in general you're responsible for installing your own application on your home directory this is for several reasons we don't have the resources the people to install every code for everyone it's not up to date it's also because you know the science and follow the science involved with that package better than we do and a lot of these things you have to know the science in order to know how to compile it code called VASP you can compile it in different ways depending on the science to run with a gamma point only or the standard etc we are always willing to help and we'd like to have you start and try it if you run into a wall don't hesitate to send a message to Balecat Help and we will sit down and either advise you on how to get past your problem or in many cases I end up actually doing the installs because they do get complicated I can also provide optimization advice sometimes the configuration may ask you to choose a compiler and some of the optimized libraries and I can advise on that so general instructions for software installation download the package decompress this decompress it and here's one that people don't always do actually read the documentations mostly they come with either a file called read me or install all in capital letters and these will be very useful mostly they will have very short descriptions of how to do the install this is the basic way of doing it in most cases it's a configure script normally you do configure make make install but this would install if you have root access which you do not have on Balecat if you just do configure make make install without setting a prefix it will try to install in root it will do a suit what's called a sudo an email will be sent to Adam and Kyle and they will make fun of you not in front of you but behind you and say what are these guys doing this is absolutely true so this is the one thing that you need to do that's probably not in a lot of the directions is you need to set the prefix where you want it to be installed so many packages are more difficult than this to install and often things just don't work even if these are professionally maintained packages it can be that our system is different our libraries are different we have a slightly older version of a library we have a slightly newer version of a compiler or the people that manage the software are just plain bad that's quite often the case we have some quantum chemists back here that can attest to that so this is an example called mother this is a bioinformatics code and in this case all we really had to do was download it and they actually have some pre-compiled binaries now even if you see there's a long list of what you can download it's not always clear to most people whether those pre-compiled binaries will work in this case there is one called mother. c-e-n underscore 64.zip now when I see that what I see is this is sentos 64-bit stuff that'll probably work on our system so we didn't actually have to compile this we just uncompressed it and the binaries were there already that's always the easiest way if there's pre-compiled binaries there's almost always the source code and this'll be in a source code they'll give the version number and a .tar.gz the .tar is the archiving archiving is the way to put many source codes into one file and archive and then the gz is the compression algorithm that's used so again with mother there was no good documentations no re-me or install files so I had to do a google search for mother install and found some they just have you edit the make file and leave you at that so you have to know how to edit a make file what's a make file well I showed you how to compile a single file at a time a make file will take a larger code that's made up of a lot of smaller source codes and header files and it'll provide all the rules that's needed to combine all those compile them into one object that you'll end up running one executable so at the end after you type make you can let leave it go for maybe an hour while it compiles and then you'll have your executable which is called mother another example is this abyss code this was much more straightforward configure I gave it this place to do the install but there are a couple other options that I gave it to make it run more efficiently for memory reasons and then there's also it depends on another library called sparse hash that I had to compile first and tell it where it was at but after that I did a make a make install and that was about it so I showed you that the base way of doing this is three commands it's almost always harder than that we do want you to try it but don't hesitate to come and ask for advice or help our offices are up here on the second floor minus 2219 and they're tucked a little further behind so feel free to send us an email just so you know they want me to intercept everyone before they get back to them I fooled them though I'm never here in the morning they don't have that barrier in the mornings okay additional information first of all we have a lot of information in the baocat documentation the workshop directory that I had you copy I put a UNIX introduction pdf in there so if you're not very familiar with UNIX you can go through that there's also Oklahoma University puts out supercomputing in plain English you can look through that if you're more interested in MPI or OpenMP this will give you some overviews it starts out fairly general it's very complex so you can stop where you want and I will stop final questions was this useful I'm trying to expose you to a lot of areas of high performance computing so that when you deal with these as a user you'll at least know a little bit or have a little bit of a framework a little bit of a background about what some of these things are about I certainly don't expect you to be able to program in any of these languages or anything like that no comments you only have words that you think will be too hurtful for me okay well well adjourned thank you for coming