 So following up on Dave's summary of my opening remark, which is welcome and thank you for coming I will say for those of you that haven't met me I am Danny Andreessen, and I'm the one that approved your account when you applied for it and Anyway, it's been fun to see everybody applying with lots of different domains and Sciences and research education that's going on And the areas that in which Bay of Cat is being applied have just been expanding like crazy, which is great so thanks to have fun ask questions and In the event that we go too fast slow us down And if we aren't going fast enough tell us and we'll make sure we put more content in next time And so yeah, give us some feedback and let us know how we're doing and also feel free to email me or Be okay at at CS if you've got some questions comments Suggestions You know, what are we doing well at what are we doing poorly at how can we help and go from there? So thanks for coming Okay, so I'm Dave Turner. I'm the application scientist for the group and Just to introduce some of the other people involved we have Adam Tigert and Kyle Hudson here who are assistant men's for bail cat and we have Brandon Dunn here. He's going to talk about Get repository in a little bit Using that as a revision control system Our agenda for today is to take the first hour and kind of do some advanced Talks about a little bit about modules and then about the Slurm scheduler I'll also talk about a little bit more about using ganglia to look at your jobs as they run Which allows you to look at the code in different ways than what case that will allow you to do the Then we'll have an overview of get and And About an hour in as Dan said we have to give up half of our room so we'll take a few minute break and Condense down to one half of the room the second hour I'll be talking about parallel computing and kind of give you an overview of what parallel computing is and Just to give you a flavor so that when you look at a program and see an MPI statement in there You know what it is? I'll then lead that into some performance stuff. You'll be mainly if you're using Processor codes you may not be programming them, but you may be using them so you may have to know Performance related issues and I'll go into some of that at the end and then lead into Some software installation if you have to install your own software, how do you go about doing that? So we're going to start out with talking about modules a little bit more Just to kind of get everyone on the same page again I'm going to cover a little bit of what I did on Monday With our conversion over to centos and using the slurm Scheduler We're also using modules So if you're going to use Intel compilers, for example, you need to actually load those modules So one of the common tool chains chains to use is this IO MKL. I Actually have a module load of this in my bash RC file So that each time I log in I get this tool chain and so if you do a module load of IO MKL That loads the Intel compilers. So ICC and I fort it also loads open MPI compiled for The Intel compilers are compiled with the Intel compilers and It loads the Intel math kernel library, which includes boss libraries for fast for a transforms Lot pack stuff. So this is an entire tool chain that you can use to build your codes The other main one here is This fos s the free open source software. This is the same thing, but it's the GNU compilers Along with the GNU compilers instead of the math kernel library you have the free equivalents which are open Bloss for your Bloss and law pack and FFTW and then scale a pack has your Multi-processor law pack stuff in it. This also includes a version of open MPI, but this is compiled with the GNU compilers So one thing So typically if you're going to do a build you'll either want an Intel build or a GNU build So you would typically load one of these or the other If you want to have access to both at the same time You can pick and choose various ones like since I have to do Intel builds and GNU builds in the same day What I could do is I could load the whole IOM KL tool chain and Then I can't really go and load fos because if I did that The open MPI would conflict. There's open MPI in both those So I couldn't just do a module load of fos if I did that it would Tell me that it's putting the GNU version of open MPI in there instead of the Intel version So what I could do is pick and choose some of the GNU stuff like just doing a module load of GCC to get the compilers for example So you can set it up so that everything is available on both sides except for where there's a Conflict like the open MPI for example, one of the nice things about modules also is that you can pick and choose the Version So for if you do a module avail and grip for ICC there's many different versions of ICC So if your code for whatever reason doesn't build with the newest version you can go back and look at older versions Let's see. I'm missing my bottom stuff. How do I switch back? Windows person Yeah, I'm not able to switch back So what key did you hit the windows key? I don't have one of those on my Mac. So I'm unfamiliar with it Yeah, okay, so let's see Yeah So again Yeah, are we recording? So I have the IOM KL in there. So again, if I just did a module load of FOSS There's a place. I can't read the screen It would again load it. But again, it's telling me that it can't It's already found a conflict. So it did go back and replaced The ICC version of open MPI with the GNU build. So again, that's what I just went over So again, if I wanted both I could do the IOM KL and just load in the GCC for example and get Both but choose the open MPI that I want so here I'm doing a module available Available and gripping for ICC and again, you can see that if you need older versions of the Intel compiler You can go back to seven people seventeen point one the eighteen point one 18.0 is the twenty eighteen point zero is a current but Twenty seventeen point one. So you can go back a year and the same with GCC you can go back To previous versions some codes are picky about that So it all depends The default ones are the newest ones, but that's nice thing about modules is that you can go back To previous versions if your code is is getting picky about something Okay, some of the other things with modules again. We talked about MPI Which is the message passing interface open MPI is the one that we support It's a common free one There's another one based on MPI CH out of Argonne National Lab, or this is a fork of that MVA pitch that's Developed at Ohio State University This works in the same way you have your same MPI CC and MPI for Compilers and Then your MPI run statement, but if you want to use MVA pitch to instead of The open MPI stuff all you have to do is a module load of GM VA pitch That sets up all your paths for you Then you just use MPI CC to compile and MPI run to run stuff So that's a nice thing about these modules is it sets up all your paths to the binaries and to the libraries as well so switching to a different version of Open MPI or MVA pitch is easy the other thing I'll mention here is CUDA So if you're going to use GPUs if you have to compile a code to use GPUs Then you need the CUDA library to compile for the NVIDIA graphics processors So you if you're going to do that you can do a module load of CUDA I'm not sure what GCC CUDA versus CUDA Adam CUDA is just the CUDA is just in video stuff in this case GCC CUDA also loads up the GCC Yeah, so in general if you're going to deal with Compiling CUDA it can get tricky because CUDA is a little picky on the version numbers We've run into one problem where we're running CUDA 9.0 and the code needed 9.1 Which needs different licenses actually driver licenses and stuff But it can also be picky on what compiler you use so you start by you doing a module load of CUDA and That will should load the compilers that it needs and If you run into trouble with these tool chains the easiest thing to do is send us email and start working with Adam he's done a good job of of Working with these tool chains To get the right versions or several options for versions of compilers with CUDA That he's needed to compile some of the packages that we already have installed as modules But that doesn't mean that a new software package won't require something else So again The modules are made to make it easy, but getting the right version numbers to Compile given codes you may have to work a little at it and the best thing to do is to ask us for help Okay, I think that's all I want to talk about modules. I guess the other thing is If you're having trouble compiling When you send us a message the first thing you should do is a module list That gives us a complete list of everything that's in your All the modules you have loaded and all the dependencies that they loaded and this can help us to decipher what exactly is going on Okay, any questions about modules? Okay, most of stuff is Things that we went over on Monday. I will go over installing software some more at the end of this Okay, I'm gonna go on to Some advanced slurm Okay, so We covered the basics on Monday of how to use slurm instead of Sge we have the slurm scheduler Fairly simple Way of converting from Sge script to slurm is the kstat.convert Perl script that I wrote up that automatically converts most everything over and We've pointed you to some of the Documentations on using the basics. So I want to start in it with some of the more advanced topics here And the first one is if you're asking for Additional resources You can use the minus minus G res Parameter and what I mean by advanced resources are two things here first of all your Communication fabric and then second of all your GPUs So let's start in on the communication fabric If you're running multi-core stuff on a given node You don't need to worry about that. That's within a node. This is talking Between nodes The communications fabric between our nodes on the elves we have what's called QDR in finna band That's 40 gig but some of that is Is is Essentially check that's you only I can only measure about 26 gigabits per second and The same is true on the moles We're getting about 26 gigabits per second and about a one and a half microsecond latency. So that's very good on the heroes and dwarves That's using what's called rocky our DMA over converge Ethernet R. O. C. E That uses the same in finna band software, but it's over Ethernet and on those two systems We get about 40 gigabits per second Although we're seeing some problems if you're using multiple streams out of those So those are the two choices If you're doing multi-node stuff that's embarrassingly parallel So you essentially are not worried about the communication because you're doing so little Communication that it's not really taking any time Then you don't really need to request a specific fabric You can just run it on more cores you get that much speed up. That's great If you are trying to do multi-node stuff That does require significant bandwidth Then you need to request a given fabric and request it in the correct way So with G res what this allows you to do is is For example right here in your s-back script, you can put minus minus G res equals Fabric colon IB for in finna band colon one what this is asking for is Only the in finna band network The one means at least one gigabit per second So all you're really asking for is a one in finna band This is going to limit you to just the elves or the moles because those are the ones with in finna band on If you're doing something that's going to stress in finna band out I would put 40 instead of one This is not so important on the elves Because all of them will be up to that standard on the moles. We have some problem with cabling so some of the the moles are limited to about 20 gigabits and less so if you're running on the moles, I would definitely put 40 there So fabric colon IB colon 40 Yeah, so those are some technical issues the moles are new we're still trying to work some of those things out if you're on the heroes or dwarves You do something similar you put minus minus G res equals fabric Colon rocky colon and then one is good enough Because that's saying use rocky if you want to put 40 there. That's fine, too All of them can do at least the 40 gigabits if you're if you're really stressing out the network This is something you should probably be in touch with me about because again depending on how you run the code The rocky is not working like it should and that's more of a problem with rocky rather than us So I'm working with Melanox one of the vendors. Well, I was they stopped talking to me now. So Anyway So that's how you request a given network fabric There's one other thing that you should do again if you're doing something that's multi-dode You should put in there minus minus switches equals one By putting switches equals one that will tell slur to schedule your job to be on the same switch If you have it on your job on multiple switches, there can be a bottleneck between the switches So if you're running multi-dode jobs that are going to stress out the network, you should put that minus minus switches equals one and Let me show you an example of this So this is my netpipe benchmark job name output there with the percent J for the job number requesting 40 minutes Four gigs again. This is minus minus mem equals. So that's total memory per node Rather than specifying memory per core I'm asking for a constraint of the elves So this is another thing. I just want to test this out on the elves. So I'm putting a constraint in of the elves I'm asking for two nodes 16 tasks or cores per node. So I'm asking for the entire elf This is I have priority under the reserve. So I'm asking for that partition So since I'm testing out in finnaband I I put fabric g res equals fabric I be 40 and switches equals one So that's how I'm requesting things to get everything right and then way down here. I do my MPI run commands I'm asking for two nodes and Here's the actual application and options Okay, so the other thing is GPUs If you're asking for GPUs, it's pretty simple minus minus g res Equals GPU colon and then the number of GPUs you're requesting Some codes like NAMD will grab every GPU that's available Slurm is very good in this and that if you ask for one GPU That's the only thing your code will see is the environment with that even if there's two GPUs on the node Case stat minus g will show you the nodes that have GPUs on and the codes that are in the queue waiting for GPUs so Font sizes wrapping these but Dwarf 22 has 32 cores plus two GPUs and Right now Jeff Comer is running Two GPU job. He actually owns this a lot of the GPU nodes. We have What five GPU nodes that have dual? GPUs and they're owned by him and then we have two single GPU dwarfs that are open So if you need to use GPUs, you're gonna have to email us and let me know and I can put you on the list That's approved for using GPUs Okay, so Parallel jobs So again a lot of this For jobs within a given node You basically have to So the way I recommend Asking for cores is always if you're even if you're on one node, I would recommend Specifying minus minus nodes equals one and then minus minus n tasks per node equals and the number of cores that you want There are other ways of doing this. That's I think the broadest and clearest if you're asking for multiple node Jobs you can control how they're spread. You could do minus minus Let's see Oh This upper one here minus minus nodes equals six will give you six nodes And then in minus minus n tasks per node equals four will give you four tasks on each of those six nodes for 24 cores total You can also just ask for 40 tasks if you do this It'll give you 40 tasks on Just spread across whatever nodes it feels like it wants to put so you should only use this if your code is embarrassingly parallel It does mean that it'll schedule a little faster because it provides the most flexibility to the scheduler Okay requesting memory for multi node jobs again You can either request memory on a per core basis or per node basis Getting email sent to you. The most common thing is to set the mail type to all This does begin and and fail as well as req if you want to control that more finely If you say you just want to be emailed When your job begins you can put mail type equals begin or you can separate things with commas begin comma and for example Job naming. I think this shows the short version minus capital J and job name. You can also do minus minus Just job name. I think That's what I use. I guess I can look back at my scripts. I use it. Yeah, and The normal for you can control where you're a standard out and standard error go I Separately if you want I think it's always better to put them in the same place and that's done automatically Under SG it divided those but under slurm it it combines those automatically And your normal output if you don't specify anything will be slurm dash then the job ID number dot out so I always like To customize that so I use the minus minus output equals And I like the SGE style so I do output equals and then the application name net pipe Dot oh and then percent J for the job number that gets me to where SGE Did before it still tax on the job number, but you can put in there whatever you want the percent J Is useful and that that puts the job number in percent X uses the job name So if you wanted to do something like output equals percent X Dot oh percent J that would get you similar to what SGE did Running in the current working directories you used to have to specify that with SGE. You don't with slurm slurm does that automatically If you want to run on a specific type of machine again, you use the constraint command so It's best if you just put in how much memory how many cores and that kind of stuff that you want to run on and let the scheduler Put you in the right place If you do for some reason want to put it on to a certain class of machines The constraint is how you do it if you want to aim at multiple classes Like heroes and dwarves. I haven't tested this out extensively But I think you put a pipe as an or so you'd constraint equals dwarves pipe heroes other things to use constraints for is Kind of the class of the processor so AVX Some codes or if you compile them In a given way will require AVX or AVX to AVX Would preclude them from running on our mages. For example, they're just too old. They don't support AVX AVX to we were we preclude you from running on the mages and the elves If your code isn't compiled with us, don't worry too much about it. So Slurm environment variables so if you're we're going to talk a little bit about array jobs Where we're going to need some slurm environment variables These you use in start inside of your s batch script and I'll show you a few examples But for most anything in in your s batch script. There's an environment variable like slurm slurm job node list For example, there's one for job name. There's one for the number of nodes number of processes job partition So you can just look through this when you're writing your script up and get a full list of Which ones are available and you can see it's quite extensive Let me go through a few more things and then I'll show you some examples So this is again basically running from an s batch script That we went over on with on Monday, this does so a Example of requesting the fabric although that's commented out with an extra pound sign okay so let me go back and Show the example again of a Few environment variables. So this is my netpipes group submission script again and I do some processing so With netpipe, I'm asking for the complete node Because I'm doing performance tests. I don't want anyone else using the network So I'm asking for the full node, but in times. I only want one processor on each node So I have to process the node list and cut it down to where I'm just using I'm requesting one core on each rather than all of them So I'm using the slurm job node list as one of my environment variables there And then I think I'm also using n procs. Let's see Yeah, so n procs down here I'm just shortening it slurm and tasks as a number of tasks that I'm using so since I'm asking for two nodes 16 cores per node the end tasks would be 32 in this case So there's a couple examples of what I mean by an environment variable First slurm and we'll look we'll see a little bit more of that when I talk about array jobs Okay file access So in general you're going to a lot of times be running off of your home directory Your home directory is limited to one terabyte of disk space for a lot of users. That's an enormous amount For other users. That's not nearly enough the bioinformatics people I See one at the end here. We'll say that's very little amount They sometimes deal with files that are a hundred gigabytes in size So if you're using files a hundred gigs in size or more than your terabyte We want those files to be put in your bulk directory So slash bulk and then your username Both of these are on the Ceph file system. They're both equally as fast access It's just that your home directory Will get backed up We don't have enough room to back up all those big files It's still safe on both because they're striped over multiple hard disks. So if we lose a hard disk It's not going to be an issue. How many hard disks would we need to lose on? bulk to lose data Okay, so we would lose to yeah, so we would have to lose three whole machines before we lost data on bulk So we don't want you to think that it's not safe there. It's just that when we do get full backups running We're going to do those on homes and not on bulk So anyone using large files should get in the habit of putting that in your bulk directory It's just as easy to access there You can make symbolic links from your home's directory to bulk so that it looks like it's on your home directory And if you have trouble with that, let us know But that's one habit we want everyone to get into if you get over a terabyte Then we're going to start yelling at you and say move your files around and yeah Yeah, we've actually done a good job of most people have moved their stuff over to bulk There is a scratch file system It's not very useful at this time because it's also under sef the other issue with with sef is you're limited to 100,000 files in a given directory That sounds like a lot. It is for most people, but we do have some codes one common one is in quantum chemistry called pixade that will put a million files in one directory and Sef just can't handle that right now. Oh really? Okay, so Yeah, I am I am rewriting pixade to help resolve these issues or at least parts of it So we will be putting scratch in luster soon like and a week, right? Yeah, so So luster is very much faster than than sef is it is still a parallel file system It is not striped it is very much faster. It is meant for temporary storage, but You right now for codes that exceed some of these limits. We're having people running or the Slash temp local disk when we get luster working will have people running under scratch and then it'll be faster than the local disk and It'll allow you to look at your files that are when your jobs are running on nodes So you you can log in and if your log file is being dumped to scratch You'll still be able to monitor it very easily just by doing a CD to it from the head node Rather than having to try to attach to your job In that node So this is something coming that will be an improvement, but right now if you need more Accessor if you need around to get around that hundred thousand file limit We need to get you on local disk or on a RAM disk so local disk you access by simply dumping something to slash temp If you're on the head node for example, you can look at what's on slash temp That's a very common place how this is actually implemented is when whenever you start up a job it creates a Directory for your job that's slash temp slash job and then the job ID number But to you the user it's going to be accessed just like you were using slash temp So if you want to use your temp just write stuff to slash temp. You don't have to CD to anything else This is an example First of all when you're using temp you want to request The amount of space that you're going to be using so in your s-batch script you would do a minus minus temp TMP equals 100 G for 100 gigabytes of requested space if you use Over 130 gigs that precludes you from using the elves. They only have about 130 gigs of Space and temp most the systems have more like 500 or above But that will reserve that amount of space for you Get you on the right nodes. So an example of how to use temp would be to In your script put copy your input files to slash temp You can make a directory slash temp slash out and then when you're running your application You have to direct the input and output to there now your application may or may not have that capability and Sometimes the applications require you to do this through environment variables So if your application allows you to steer where the input comes from and where the output goes to You can do it in this manner and then at the end always remember to copy your stuff back to your home directory here I copy Minus RP is recursive P protects the dates on them Everything that went into the temp out. I'm copying back to dot which is my current working directory If you do not copy things out at the end When your job finishes it cleans up slash temp and your files get deleted So you have to remember to copy stuff back out So again, this is using copy in and copy out and then your application has to be able to access the paths to those Another way of doing that So again, this is pixade Yeah, it's minus my job dash name equals is what I use. Okay, so here I'm asking for 100 gigs of temp space and specifically as well those are commented out so then with this one What I'm doing is I'm making some directories. I make a directory on temp slash temp slash out and Then here's where I'm running the code and the code itself. This is a Python code So I put in the Python code to use slash temp slash out as the output So you have to work with your application on how to tell it to use that Directory the other thing you can do is you could move everything over to temp Do a CD to slash temp and then actually run your code from that as the working directory So that's another option but then here at the end I copy stuff back out and Since this code deals with again more than a hundred thousand files in the out directory I actually use tar to compress and To archive and compress it before I move it back So the same thing can be done with a RAM disk a RAM disk is using RAM memory as a hard disk, so it's very fast In a case of a RAM disk you have to request The extra amount of memory that you're going to use for your storage so in picks aid In the previous one I was asking for six gigs of RAM and hundred gig of temp space on the local disk Now since I'm using a RAM disk I'm asking for a hundred six gigs of RAM so a hundred for the RAM disk and six gigs for the application so again, I commented this part out because I'm not using temp space and I'm going through the same setup but in this case instead of Asking for memory or asking or instead of setting up the directory in slash temp slash out I'm using dev shm, which is the path to the RAM disk And then again in my python script. I have to put that directory in there and Again when I'm copying stuff out. I use dev dev shm out and When I'm tearing it up. I tar up the RAM disk stuff Slurm is also nice in that it cleans up the RAM disk afterwards with SGE We used to have to worry about if you don't man manually delete stuff It would leave stuff in memory behind Adam's got it set up great. So that it automatically cleans things up. So it doesn't leave stuff on the notes Okay, last thing I'm going to cover on this is a RAM disk So sorry array jobs So if you're going to run more than let's say 20 jobs that are similar We would like you to put them in as an array job an array job allows you to submit lots of similar Tasks as one job This makes it easier when you do a case to add it shows up as one job line It makes it easier on you to control your jobs It's easier on the scheduler because the scheduler knows that they're the same or similar jobs And it doesn't prevent your jobs from running any differently The only exception to that is if you're running Array jobs of more than 300 tasks. It does limit you to 300 tasks You can always put in multiple array jobs to get around that if you're doing that and pounding on our file server then we may object to that but otherwise That works fine so with an array job What you do is you use minus minus arrays equals and then you tell it the number of tasks out of the task range so Array jobs so arrays equals and then One dash ten would give you ten Essentially jobs or I think you can also do one colon ten You can also step through it. So if you did one colon ten colon two that would step you through by twos You then have to do a little bit of shell programming and There's a couple different ways that you can do this Let's see so You can the basic thing Since you're submitting one S batch script You're going to tell it to do different things based entirely on the slurm array task id So this is going to be what determines What data set it runs on or what parameters you use that's different in each of your runs And so it depends on what you're trying to do if you're trying to run the same application with different input parameters Then you can use slurm array tasks to determine which one You're choosing for a given one if you're trying to use a different data set That you can Use that in your file name. I'll give you one example here of one that we developed It's probably easier than me describing it so this is one that I worked on with antricks here and So most of this looks the same Nodes tasks. So here we're asking for Uh one node one task per node. We're actually mostly doing eight tasks per node with these And then this was just our testing here He may be running as many as 25 000 array jobs here So this would be 25 000 I think that exceeds our limit now is 20 000 kind of thing Okay, so this would allow him to submit 25 000 jobs at once with one script And what you see here is we've generated a loop that has a counter the counter goes Basically counts up until we hit the slurm array task id The this particular one is going to do it reads from a given file of file names So here's the the file names are in so file names is actually the the name of the file That we're getting the file names from and then we're reading into a variable file name And incrementing the count so when we get to the right one Yeah, so when we get to the right one We'll have a different file name each time It'll bail out of this loop. And then here's where it'll actually run so again if you debug this the first Array that gets the first job that gets launched will have Slurm array task id set to one It'll come here count will be zero. It'll read a file name It'll increment count to one. It'll hit the loop again It'll say well, it's no longer less than so i'm going to bail But we have the variable file name out of here And then we go down here and we use that that'll be the first file name That we read out of that file and it'll do the purl script with that So any questions on this? yep so He's using file names that are not incremental, but you can certainly Instead of reading it from a file you can have You know your base name dot and then your slurm id task. So that's very commonly used. So that's a good question Yeah, so for people on zoom you could use for a file name you can use you can have it increment the The file name you can use file names that are increments And have the slurm task id be your increment rather than reading file names again so let me just uh So this is actually our list of file names that it's reading from And in this case you can see that they're not incremental So we start at 17 and go to 33 and 50 So if it was starting at 50 and going 100 150 200 Then we could slip we could step through with the slurm task id and set our tech our step to 50 But entrusts had to be difficult. So we had to be a little trickier So Yeah, so they skip around which is why we had to do this so So this is a more advanced one And again here This is our test here, but our normal run would be this way where we'd be running Only 1775 of these But in each job Here we're doing 10 individual runs So we're still counting up and when we get to our slurm task id We're reading a file name We're copying stuffs to slash temp Because we want to get around some of the Limitations on files per directory. We're running it Then we read another file name and et cetera, but we're doing 10 things at a time this time Because each of these jobs might only last three to 10 minutes And if we stacked up 10 We can get around some of the overhead of having too little time for each of the jobs. So And this is working pretty well for us now And again, this is an example of using temp And at the end we're copying stuff back out of temp here So We got around several Problems with that Okay, so let me go back to my web browser Maybe Yeah, let me finish up by saying I think I'm finished Okay, so everyone's gonna have to move over to this side and then brandon's gonna I ran him out of time. So he's got 20 seconds instead of 20 minutes Yeah So in other words, he gets 15 minutes and the rest comes out of my time at the end. So So we're gonna close off the room and so take just a couple minute break Don't answer Right Everybody And honestly you lost your stuff But when you do full screen Yeah So Okay Yeah, everybody hear me So I'm gonna cover you guys a little bit about git So for those of you who don't know git is basically version control for software development. Let's you track your history Keep changes revisions and things like that. So as you're developing code You have a long history of everything you've done to that code and all the changes you made files added Lines removed added things like that We host our own Git here at kstate. We actually have two one in the computer science one for beokat since you guys are a beokat You'll probably use that one. Um, if you have a beokat account, you can log in through the ldap Like i'm about to do Now numlock wasn't on I'm used to it being on or I can't type nevermind It's been a long day leave me alone There we go So this is the interface for git lab here hosting on beokat And so as you can see I have a lot of projects But this is kind of what your interface is going to look like when you pull it up with your projects So the first thing to do is create an actual project for your project. You're going to be doing Really simple You click the button that says new project Um, a lot of this is really self-explanatory. So you create your project name Yes typing is hard people shush So you create this usually you want to give it a little bit of a description I'm not going to bother with that for time's sake. I will hit some of the high points here Um when we talk about visibility level because this may or may not be important depending on what you're doing for development wise So private is exactly what that means you and only you have access to it unless you grant others access Would you have to manually grant So this is a good way if you're doing some development that needs to not be visit publicly visible Internal here means that any logged in user So if i'm logged in as a bio cat user and i've authenticated through a LDAP and i'm logged in It is visible to me Even if it's your project i can go in there and see it And that's only for logged in users and then obviously public is what it means Public is if everybody in the world can see it and clone it and download it They can't push to it, but they can clone it download and use it so Those are important Create your project. This is going to be the page you get after you create it There right now the project is empty. There's no files No, nothing in there and git is going to get mad at you if you try to clone the empty repository You can do it. It's just going to complain at you So the easiest thing to do is click one of these blue things here either the read me the license or the get ignore I almost always do the read me because you're going to want to read me and The read me is just a longer version of the description kind of tells what your project is what it does and all the pre-rex you need for it I'm not going to do any of that Don't do what I just did And then we commit this So at this point we now have our base project started and we're ready to start cloning and using this for our version control A couple of things i'll highlight on here And they've re-changed the interface I want to find some of this One of the nice things about the web interface and you can even do this on the command line And other things in the git gui but one of the things on here you can come in and you can look at the commit history So you can see when you've committed last or if you're working on a team who's done all the last commits Really useful information there Which means to another point when you commit There is a place for a message Please put a very meaningful message Not long doesn't need to be a paragraph but meaningful when I say meaningful is here's what I completed Here's what needs to be done Concise good to go It'll save you in the long run you go back. Why did I do this? Um a couple other real nice things you can look at here Um real fast as graphs So nothing interesting in here because I just created it But here you'd be able to see a bunch of different things I can get to one of my other ones to show you later So but the big point is we have it created and we were ready to use it so Now what do I do with it? I have it up here. You need it on your local machine to work with right So you're going to do what's called a clone and you're going to Basically make an exact replica of what's on online and bring it down to your machine And you can do this one of two ways as you can see we have ssh and we have hgtps Unless you've you have to go through a lot of steps to set up your ssh keys Which I don't have time in 15 minutes to show you go through this It's pretty much self-explanatory. They do have a good tutorial if you need help come let us know So for right now you can just do hgtps And you're just like url you're going to copy that Now if you're in windows and you have git you'll need to install git so I personally like the bash. There's also a gooey git bash is awesome So you're here. You've made your project. You're ready to get it down to your computer Um, and I'll be okay. You're pretty much only going to have the command line anyway, so get used to it So you do a git clone That's the command. It says I want to clone that remote repo And then I'm going to paste in that url And it's going to prompt you for credentials because you're using the hgtps Done you should see that it's cloned the repo I probably should have put that in a better spot whatever So now I cd to it. You see I have my read me in there. So everything's good The other real basic commands that you're going to use I'll go through real quick So let me create a text file in here real quick That's a good question. I'm going to move that Let's just cd documents that makes it easier Just me is that cut off on the bottom down there Is everybody okay with seeing that? I know that it's fonts kind of small So we'll cd documents and then I'll just repo that. All right, so now we're back in here. All right, so We will I'm going to create a temporary document in that folder Who we logged in under? Why is it not shown up in that list? Filezilla or 7zip yet? They take my file explorer Exactly That's whatever This has vi in it It always just use vi. It's fine So Also, don't do this when you haven't had enough caffeine So now you see the read me and you see a test in there, right? um Now there's a couple steps you have to go through now to get what you have locally back up to your remote repo and One is you have to add all the changes you've just made and so that's done with an easy quick get add Now I can specify just that test file. I just made because I only want to add that to the repo Or if I've made a bunch of changes or added a lot of files and want to add them all at once You can just do a dot on the end And that just basically is a wild card that says add everything And so it's going to say hey Oh, and that warning's fine. It's just it's telling you it's replacing things from windows with actual things that are clinic space So now I've added that all and now we have to actually commit that so that it logs those changes In the get ignore and so we're going to do a get commit and If you're on the command line, I told you about get meaning messages meaningful So right here you do a dash m parameter and that says do a message And technically I could have skipped the last step with the ad by putting a dash a dash dash am But I like the ad And then I'm going to put a message in here and again that one is not good Awesome, I was hoping to swap up. So your first time running this You're almost always going to see this and what get is complaining about is it doesn't know who you are or who's posting to the repo So It's going to want you to configure these and you follow that command right and it's simple It's really hard to see that from back here. Wow And so all I'm doing is telling get to configure my username And I just mashed that up. It's double dashes You know what minor my backpack and I forgot to shut up It's been a while since I've had to do this config Oh, it's just dash it's just I've not had enough caffeine today. Let me tell you There it is. It's no dashes. That's why so you do username and then generally I do my email as well Right And then you would put in your email here. It's really easy And now I've configured it now it will be happy if I try to do my commit It's not going to yell at me anymore. Yay So we've committed and now we need to get what we have local up to the remote And so we're going to do a get push And get push will be like hey look boom. Look I moved it up. It's now online So now if we tap if we go back over to our online project here And I showed you earlier where we were looking at the list of commits Hey, look, there's the new one. I added a test and so You'll be able to see that and now if we actually go back to the graphs You'll see a look I moved things This is where the the repo moved from one position to the next And so that's kind of your basic things The one of the other things I'll suggest you is if you work on multiple machines like I tend to do a lot And I make commits from more than one machine You should get in the habit of doing a fetch in a pool every time right off the bat to make sure that your local copy on the machine You're working on Is the same as the remote And so if you just do a get and the reason I say get fetch is if you can't remember which machine you did last Fetch will tell you whether or not you're on that and then you do a get pool And you're up to date. So if if nothing changed, it doesn't matter if stuff's changed. It'll pull it down And like I said, I tend to work on three or four different computers And I always forget which one was the last one I committed on So I use this every time and it's just something I encourage you to do Just use the fetch and the pool fairly often In fact, I do it as first thing I do when I log in and I'm planning on working on code I do a fetch and a pool and that way I know I'm current with what's in my repo Um, the other thing I'll tell you is you don't have to push right away So you can keep your commits local if you there there's a couple different philosophies on that So you can keep your commits local if you're working all day commit often So those saves or change those changes are saved And then I know people who do one commission commit at the end of the or the push at the end of the day that pushes all of those commits So it's Successive you have to add you can't you can't push till you commit you can't commit till you add But you don't have to do the success of one after you can just add or just commit or you know, I mean So there's that Kind of the big philosophy is really just version control So say your code was working yesterday. You made some changes now. It's broken. You can go back and say, okay Move this head from here to here where it was working and I'll start over and it just basically says that broken parts goes away I don't care about it anymore, right? Um Yeah, so if you can't remember what you changed that broke it you can pull those up and every commit Will show you what was added what was deleted what lines and what files So all that's trackable and there are some really more advanced things if you're working in a team that I would encourage you guys to look up and read on like branching and merging and things like that when you have merge conflicts It's really too advanced for the 15 minutes. We have here, but there's a lot with git and you don't have to just do it for The interesting is it's not just for code development Honestly, if you're writing a book you could do that there too because it would track as you update paragraphs and things Like anything you want to keep changing or versions of you can do in here documentation Anything like that. So it's really just an overall really powerful tool to keep track of things as you change them Things like that any kind of questions. I know this was fast It's really basic. There's lots of really good tutorials and if you get stuck I recommend setting up the ssh keys, but I'm lazy. I don't like typing in my username and password So there's minimal typing as possible. So if you guys need help with that There's plenty of tutorials online the git repo we host here has its own little thing when you click set up ssh keys And if you get stuck on that I or adam or kyle should be able to help you guys get through ssh key ssh key stuff. So Um, yeah Yes If you write it Okay, so the next Stuff that I want to go over is Kind of to give you a flavor of what is high performance computing Uh, not all of you are going to do multi-node stuff, but it's good to give you a flavor of it so that you're aware of What different types of programming Are involved And you'll know it when you see it. So I'm going to go over A general discussion of what is high performance computing And I don't expect you to get a lot of this There's a lot of good tutorials if you actually want to do programming in any of these uh methods But again, it's just good to expose you to some of this stuff. So I'll start out with just a generic picture of what baocat is When you get on baocat, we have two head nodes Eos and climbing Uh, those are the nodes that you do your uh compilations on you set up your data with all that kind of stuff Uh When you get to running Code you'll submit it to slurm The scheduler and the scheduler puts it on Uh, one of the nodes the compute nodes depending on how you set up your script Our current compute nodes going from oldest to newest We have six mages with a terabyte of memory We have 83 elves a lot of those are down at the moment and maybe down permanently or repurposed Both those systems are the elves and the mages are About five years older over older. So they're uh towards the tail end of their lifetime Uh newer nodes Our hero nodes is what we call them. These have the haswell processors 24 cores Most of them have 128 gigs and again fast uh communications And the dwarves also have the haswells, but 32 cores and again 128 gigs 12 of these 12 gpus and these and some Higher level 100 gigabit networking that we're still playing with But these are good core workout horses here. These are good new machines And then we just got 120 moles and These are 20 core broadwell chips 32 gigs of ram. So they are smaller memory They still have good networking connections and the 32 gigabits per second We're still going a little turning on that The other uh thing to mention here is Only one gigabit per second ethernet. So where that comes in is that's the access to the file server So the moles have slow access to the file server and they're smaller in memory But other than that they have the broadwells are newer chips than the haswells 20 cores instead of the 24 over 32 cores. But still these are very good machines They're just made for running smaller memory jobs and jobs that don't pound on The file server as much We have 7,000 intel cores total on just over 300 compute nodes again good infiniband Good networking in between all of them One petabyte effective file server We're setting up nightly incremental backups We have 12 GPUs in the nodes and we actually have an order that's A little over 20 New machines coming in with four GPUs per node. So we're really expanding on our GPU Capabilities We're also adding two large memory machines in right now our large memory and machines our largest memory machines Are the six mages with a terabyte of memory in The two new machines are going to have one and a half terabytes of memory And they're going to have some Tesla class p100 graphics processors that do good computations that double precision The The other GPUs we have are mostly single precision capable and a lot of classical md codes can use those But not all codes can be accelerated by those So that's kind of an overview of the system So I just want to go over What a high performance computing system is and I'm going to start out by going back Maybe 30 years or so and showing you what a basic System is from back then it was very simple times and that you had a single processor Memory The program and data would be in that memory when you did things you did things one at a time So it was conceptually very easy Here is a an example of a vector addition To do that What would happen is it would start by loading x zero pulling that up. You'd pull y zero up You'd do the x zero plus y zero in the processor And then that would get shoved down into memory now There is some cache memory up here that I'm not Showing so it would get pushed into cache and eventually pushed back here But things are very simple when you think about this conceptually Optimizing code was very easy You just had to minimize the number of computations but Okay, and so this is an example code in c of how to do this Uh, if you don't understand c code, I'm just going to explain some of the basics My mouse pointer keeps disappearing Okay, I'm doing an array size of a million I'm uh Here's where I'm allocating the memory space for it one million times the size of a double precision number For the vectors x y and z Here's where I'm initializing x and y x of i is just going to be i double precision of i y is going to be i times i This is the loop where we actually do the vector addition So again, I'm setting up my loop from zero to a million I'm just doing z of i is x of i plus y of i And then I'm printing out the first hundred. So very simple code here Um, I have this code. I have Uh, some sample s batch scripts I have the slides here all in this directory So, uh, if you do a copy minus rp of slash home slash daveturner slash baocat workshop Space tilde that'll put it All in your home directory Tilda is your home directory And then if you cd to baocat workshop You can actually Run these if you want. I'm going to go through them pretty fast So I don't know if you'll be able to keep up But if you want to go back and look at the slides and do some of these This will allow you to practice doing some of the module loads and do the actual compute the actual Compilations here. I see so I module load iom k l Then I have the icc compiler vec underscore add dot c is this file And then minus o I'm naming the executable vec add icc Running it you can do that with just dot slash vec add icc I also have an s batch script setup so that you can Submit it to the queue It's it's just going to run on one core is the way it's set up This example down below there is how to do the exact same with the gnu compiler So module load fos gcc on the file, and I'm Renaming the executable with a gcc at the end and then you can Run it with that you can also alter the s batch file so that it runs that version So this is the core example file We're slowly going to add more complexity to this as I walk you through time So after using scalar computers like that that were very simple The next innovation that came up were vector computers, and these are the cray vector computers Instead of using silicon technology these use gallium arsenide technology, which is faster But it required custom development of of all the cac callium arsenide Technology, so it was very expensive But what we got out of that In addition to these neat looking computers here Were was the ability to do vector computations what I mean by vector computations Is instead of doing one thing at a time you could do 64 things at a time So here's a diagram of the vector processor More memory here I'm showing a much wider memory bus to get up to the vector processor because When you load stuff when you do the same vector add now instead of loading x zero up You're pulling x zero through x 63 up at one point into the vector processor registers Then you're pulling y zero through y 63 up And then you're doing the 64 computations essentially at the same time and then you're storing them So the whole idea is every time you hit a loop You're doing 64 things at a time instead of just one So hopefully you're doing things 64 times as fast This is true in vector computing. It's also true true and in parallel computing the trick with Either is that you have to speed up every loop that you're dealing with If you don't then you get bad performance. So this is an example where If you have three loops the first taking 30 seconds the second 20 the third 50 Let's say you can speed up the first two by 64 times But not the third one for some reason So the first one would take 30 over 64 seconds. So less than a second Second takes less than a second, but now the third one we can't speed up. So it takes the full 50 seconds So now we've sped up our code from 100 seconds down to about 51 It's only twice as fast all because it's that very last loop that we couldn't speed up So again, you're only going to get that big jump in performance if you speed up every loop Now what could limit you? From speeding up that third loop if you have a print statement in there That's not vectorizable, but more commonly you would have things like if You're if each iteration depends on the previous one Then you can't do 64 things at the same time If there's that dependency in there that breaks it that means you'd have to go in and reprogram it So that there isn't that dependency Or maybe your code just isn't vectorizable Okay, so yallium arsen arsenide technology was very expensive. You had to customize things People figured out that well We're better off just using the the silicon technology. We have economies of scale The processors are very good. Let's throw lots of these processors at the same task And that was the birth of cluster computing So cluster computing is parallel computing Now we're doing We're running the same program on different processors On different computers, but each code is the same. It's just working on a different part of the problem In the case of our vector addition, it's pretty simplistic in that Each of these computations is very independent of the other So in this case, I decided to divide up the program by putting all the even numbered elements on computer zero and all the odd ones on computer one and then To calculate zero z zero. It's already got x zero of y zero on I don't have to communicate So this would be an example of an embarrassingly parallel code I essentially once I distribute the data at the beginning. I don't have to communicate to do the problem Things get complicated very quickly. This is a very simple code That does a matrix multiply So over here is your formula for each element z sub i j. We just do a sum of k equals zero to three x sub i k times y sub k j Now conceptually very simple. This would be just a few line program on a scalar system But if you do this in a parallel computer You have to have The x and y values in the same place at the same time to do this computation And so I'll let you read through if you want But you have to do several stages of broadcasting blocks down the rows Shifting them in vertical directions and so forth in order to choreograph all the communications To get things in the right place at the right time So this is the type of work that I would actually sit down and do I don't have to do this because there's libraries that People have done this before me, but I've done stuff like this And there's tricks to doing this while you're doing the community the computations You can be setting up the communications and trying to hide those hide the communication costs behind your computations and things like that This slide is just to show you that things in parallel computing can get very complicated very quickly If you're the one that has to choreograph all this work now Cluster computing we're we're uh, we're passing messages between nodes and that's done with what's called the message passing uh initiative or mpi mpi is a library All the commands to use the mpi library are mpi underscore And then These are the first three commands you'll see in any mpi library So this is the kind of stuff that I want you to get out of this Is not to know all the intricate details of how you do it But just to kind of get a flavor so that if you see mpi underscore And then these functions in your code, you know that hey, this is a multi node capable uh code So mpi and knit When you do your mpi run that starts The same code on different processes whether they're in the same node or different nodes mpi and knit Just does the handshaking so they all say oh i see over there. You're part of my group Now we know we're a group of nodes working or a group of processes working on the same job mpi com psi size We'll return this value n procs That tells you the number of processes in your group and my proc tells you which process you are So if you have 16 processes in your group my proc For your particular processor will be a number between 0 and 15 That's how you divide up who does what work is specifically on my proc And and procs. So that's why you see those three functions at the beginning of every mpi code In this case i'm just passing one token between two different nodes So one token a token is a variable here So this conditional here is how i'm dividing up the workload. I have two processes in this case process zero Is going to start by sending a variable to Process one and then it's going to wait for something coming back This else command Is going to be the second process. It's going to receive that and then echo it back with a send command So the order of things is Process zero is process one is going to come here and wait For something to come in process zero is going to come here and send after it's sent Then there's a message that comes in here Then process one is going to send it out Process zero will be here waiting when it gets it then it can leave this Loop and then they're both print something out here and then mpi finalizes the last part So again, this is an example of using mpi on a very simple case I put another code in that directory that's a ping pong code that again just bounces stuff back and forth that you can look at There are a lot of tutorials that will get you up to speed A lot faster I just want to give you a flavor and say if you see a code that's like this those are message passing commands You're passing data. You give it a buffer. You give it a length You tell it what node to pass to And the user has to choreograph all that Okay Multi-core systems is the next step up. So we started with clusters that had Just simple scalar machines one compute core Per machine connected by a network now. We're up to multi-core I'm showing 16 cores on a box. This will be similar to our elves You still have shared memory. You have a wide bus going up to those processors And there's multiple ways to program this you could run that same mpi program on this where one of the processes on one On p zero here and one is on p1 and they're just communicating between those two In that case instead of passing a message across the network It would just do a memory copy between two regions of memory The other way to program this is with open mp or some multi-threaded package Open mp is the most common of those it's easier to program The data is all in a shared area So you don't have to move the data around you just have multiple threads All or most multiple threads one on each process processor working on Different data in the same shared memory area This is kind of a simplistic diagram of how it would work Excuse me You have a master thread Running the scalar part of your code Then whenever you hit a loop where there's a lot of work to do You would spawn off your full 16 threads if you're using this whole machine And each of those 16 threads would work on a different part of The memory different part of the data Then when you're done with your loop you collapse down to the master thread And then the next time you hit a loop you again expand up to your 16 threads So again each time you hit a loop You're having multiple processors work on that loop kind of in the same way that in a vector processor again multiple You had the vector doing 16 or 64 things at the same time here You can have 16 threads working on the same loop at the same time, but just different parts of the data So it's actually a fairly easy way easier way of doing things But with open mp you are only operating on cores within a given node So you can never do multiple node stuff Here again is our example of A vector add but with A couple changes We have to Include the omp or open mp library or header information here Down here i'm setting the number of threads to four And then omp set num threads So within the code i'm telling it to run on four threads or four cores only Everything else is pretty much the same till you get down to this Compiler flag that says pragma omp is open mp parallel forks. So i'm going to paralyze automatically this four loop And What this is going to do is each thread is going to work on part of this loop I'm not telling it how to do that The open mp system is going to determine that for itself. I can give it some directions I could tell it To have the first part of the first thread work on the first quarter the second thread work on the second quarter I could tell it that That each thread is going to take each fourth one in the loop and things like that or I could tell it to just work on it dynamically You know whenever you need a new chunk go out and get it So I can give it those directions. I didn't feel the need to here So the system's going to decide that when it runs But each thread is just going to do some of these processes here In order to do this You compile it the same But you have to specify to compile it with open mp with intel compiler. You do that with the minus q open mp flag With the new compiler you do a minus f open mp And then when you run it right now, I'm setting the number of threads I'm hard coding it in here Another way of doing it is if I comment that line out Then I can set it up through an environment variable omp num threads And I could actually put this in my script So that if I run on a different numbers of of cores I can set this to the number of slurm tasks that I have automatically So that the code will adjust then I only have to change the number of Threads when I request it and it automatically gets fed through Okay, any questions on that? So you're either getting everything or nothing So we are going to continue to add complexity So we added multicore Now we're now we're doing multicore clusters So instead of scalar clusters now each of our systems has multicores But is connected by a network switch So again, since we're doing Multinode we have to do mpi between the nodes We can still do mpi within a node, which is what this shows Where I'm showing this being eight tasks These would be eight mpi tasks each one has its own area and its own processor So you can do mpi within a node and mpi between nodes If you want that's probably the most common way because you only have to do the mpi Programming you can also do hybrid programming where you do mpi between nodes and open mpi within a node In some cases this is more efficient, but you have to program in two levels of parallelism So I just don't see this being done very often Okay adding another level of complexity now we're up to vector multicore computers So now in addition to Multicore each of these processors is now a vector processor like the old cray systems But we're not doing 64 things at the same time The elves can do two things at a time if you're talking about double precision numbers The Haswells which are our heroes and dwarves Can do four things at a time if you're talking double precision Single precision Can do eight at a time There's also Intel five processors, which we don't have They can actually do eight doubles at a time or 16 single precision floats So vector processing does come in It's difficult to use And somewhat buggy intel is doing better every year But I still don't see very many examples of where We're seeing codes run twice as fast on our Haswell processors as on our elf processors, which You should see if codes are really vectorizing well Um, I have one physics code that I hand tuned It was a triple loop with with two lines of code in it That I hand tuned so that it does do vector processing And yes, it does run twice as fast on the Haswells as the older systems But I had to jump through a lot of hoops to do something very simple Um So there's a lot of potential speed up. I just don't see a lot of codes that are actually doing this well That's unfortunately Let's skip over that Okay, so again parallel computing This is kind of the same thing as I mentioned with the vector computing The goal of parallel computing is to get an end time speed up if you're using end processes Whether those are on the same node or on different nodes And again each compute node is typically running an identical program It's up to the programmer to divide the work up and choreograph all the exchange of data If you want more information on the message passing Initiative, this is a pretty good tutorial. If you just google mpi tutorial, you'll get lots of these Again, uh, lauren slivermore also has a good one on open mp Here So I think these two are pretty good. But again, if you just Google open mp tutorial, you'll get some other good ones in there, too Okay, so communication. So I want to talk a little bit about more things from the user point of view Now rather than giving you an overview so communication between processes Is one thing that's key to how well your parallel program is going to run I've said a couple times that we have pretty decent Performance on our networking in that we have somewhere between 30 and 40 gigabits per second of bandwidth And a low latency of 1.5 microseconds what latency is it's the minimum amount of time it takes to send a packet a package from One node to another And 1.5 microseconds is very good If you just use ethernet that'd be 10 microseconds or more These graphs over here on the right, I know that they're very small and you probably can't read the The magnitudes on them, but these are showing our typical communication curves The bandwidth is on the vertical scale and the message size Is on the horizontal scale. So on the left side of this graph Everything is limited by the latency or the minimum amount of time it takes to do the handshaking and communicate that first packet of you know as small as eight bytes for example So if you're sending eight bytes across it's going to take 1.5 microseconds on these If you're going to send a thousand bytes, it's going to take 1.5 microseconds So maybe a little bit more If you send a megabyte Then you're going to be limited by The bandwidth here the max bandwidth more than that small amount So when you're talking communications, you're really talking two numbers If you're doing lots of small packets You're limited by the latency if you're doing large amounts you're limited by the max bandwidth Um our newer machines are going to have 100 gigabit per second bandwidth between them And the same low latency. So if you are running multi-node stuff our new nodes are going to be much better at handling those If you're doing communications within a node Again open mp shares data. So you're not moving it if you do mpi codes are on the same node then when you communicate between two Uh processes you're doing a mem copy The mem copy Can take on the order of 60 gigabits per second and a pretty low latency. So that's Pretty decent. It's still faster than communicating off node by quite a bit and faster memory can speed that up too It depends on where your data is sitting and so forth Uh, I'm not going to touch on network topology and again file access Uh, I kind of touched on that. We have pretty good file access We're really seeing kind of a topping out about 20 gigabits per second We're gonna go you left Okay, what what's the max that stuff can use I I perf for tcp I get about 20 Okay, so we can see roughly 20 gigabits per second getting to the file server We can handle that on most of our nodes The elf and mage nodes are limited to 10 gigabits So the hero and dwarf nodes can see the full 20 gigs to the file server the current Moles are limited again to one gigabit. So that's not where you're going to want stuff that can That's going to depend on the file server access so One common question when you're running a parallel application is how many processors should I run on? And when you start running a job, you know, I can't answer that for you The first thing you should do is what's called a scaling test So you should tape take a typical size run that you're going to do And you should run it on one core two cores four cores eight cores and 16 cores Which is what I did in this example I use the time command in front of my executable so if I was Running an mpi job, I would do time space mpi run space the number of cores space and then your application And that'll report the real time that you're using In your application So submit a one core job and let's say I get back that it took 10 hours Well, two cores took five hours. Well, that's great. That's twice the speed on two cores. That's ideal speed up Now we go to our four core run. I was hoping for 2.5 hours But instead got 2.8. So I lost some in the in the efficiency But that's still pretty good a 3.6 times speed up on four cores And in this example, I'm getting a 6.7 x speed up on eight cores. That's still pretty good But here I slow down a lot. I go up to 16 cores. I'm not gaining much So in this case My advice would be don't run on 16 cores run your jobs on eight cores and You know, if you have lots of jobs to run then you can run twice as many as you could on 16 cores anyway But this would be good efficient use of your code If you're if your jump is less than 50 percent, then it's not really worth doing at your wasting resources if you Later on go to a larger data set You should retest this because more data means your communications are going to be different If you go to smaller, you might be more latency bound So your efficiency might not you might be listed to four limited to four cores instead of eight, for example So the same is true. What if you have a multi node job? Well, the same is true Run it on one node run it on two nodes run it on four nodes It's harder to do multi node jobs Um Multi node jobs used to be very common 15 to 20 years ago but the computational power of each node has gotten Has increased very fast and faster than the communications has So even though we can do 100 gigabit per second communications now the nodes have gotten The processors on each node have gotten faster and there's more cores per node So they've just gotten so powerful. It's difficult now to do multi node jobs. That doesn't mean we shouldn't try If you have a code that you're going to try Again, do the same kind of scaling test But do it multi node now and if you get into this kind of stuff Touch base with me. I enjoyed doing this and I can help you walk you through this You can see some job interactions So if you're running on eight cores And someone else is running on eight cores in the same machine That can affect your Job We really don't see too much of this But if you're using Let's say you're on an elf where there's 10 gigabits to the file server And you're running there And you might be conflicting with another job. That's also pounding heavily on that same 10 gigabits So you might only get half that so it does occur You can get conflicts for the main memory bandwidth. We really haven't seen too much of that Uh, the file server is probably the bigger one that I would worry about And if you go on to our heroes or dwarves There's 40 gigs and you can only max out to the file server at about 20 gigs anyway So you're safe on there, but it is something to be aware of Uh, ignore this minus l exclusive. That's uh from the the sge days uh Application profiling if you get into code development. I have a Profiling library I can set it up with Okay, so let's talk a little bit more about software installation Some of what I went over kind of fast Was compiling individual files So if your code is a c file or a fortran file Or a different language You can compile that in One step with a compiler and just type it out And not worry about it most software Packages are much larger. They might have hundreds of files A bunch of header files. They might be in different directories if You are installing something like that Chances are there's already a make file available What a make file is is it's a list of dependencies for example If you have a c file a c file will depend on a header file or a series of header files All those dependencies are in the make file. So the view change the header file It knows to go back and recompile the c file every time you type make So those dependencies are already built in for you and so typically what you do in Compiling software if it's set up right is to do These three steps here configure make make install With configure it'll mostly try to put it as a system install Which doesn't work because none of you have root access So if you do this It is likely to give you an error message where it says su for it couldn't do a sudo and Messages pop up for adam and kyle to see Yeah, and they will laugh at you behind your back It's you're trying to do something it has root that you cannot do If you try to do it multiple times they will email you And tell you you can't do this So that just means that you forgot to put or didn't know that you had to put this prefix this prefix tells Configure that you're going to put it in this directory. So I have a certain directory apps And then I would put the app name here And that would tell it where to install it then I would type make make install It would be nice if every software package was this easy But it's not In some cases there's no configure script. There's no make install. There is a make file in a lot So sometimes you just have to get in and edit the make file. So that's not bad Hopefully there's a read me file Read me files are great because That should tell you the simple directions for how to Install that software package Others are more difficult Often things just don't go don't work like they should I've had software packages where I've had to Try compiling them as many as 60 times before I got them to work Some say they compile fine with the intel compilers But I I couldn't ever get them to work because the authors compile under One set of circumstances where they have one particular level of intel compiler with one type of math library And there may be a half dozen libraries that have to be the exact same level And they just don't do broad enough testing for them so For codes that Are not needed by a lot of groups You're responsible for compiling them But if you at least start at least read the read me file and Try to get through them and run into problems Then email us and we will help you This is an example of one that's actually pretty good. This is a bioinformatics code called abyss This is c++ code. It's got a professional development team up in canada There's a read me file Configure you put the prefix in there Uh, I had to put one more thing to choose the maximum camer size This does depend on a sparse hash library So I had to compile and install this myself first And I have to pass that in as a flag, but then I just do make and make install So this is fairly straightforward. I did have to read the read me file If there's no read me file configure space minus minus help will tell you some of the configure options things like that This is another one called mother, which is a bioinformatics code One thing nice about this is that when you do a search on it It already has some pre-compiled binaries So I found mother dot sen underscore 64 dot zip Maybe that doesn't tell you guys anything That tells me a lot Because the cn is sentos Now before we were on gen 2 This may or may not work under gen 2 because it's compiled under sentos the 64 64 bit So before we act to actually compile this from source code, which means a dot tar dot g zip file They also had a mac and windows version So they do The right things in trying to distribute pre-compiled Versions of the software if you get that that's awesome because the compilation has been done for you All you have to do is uncompress it and use it Most of the time you have to compile it from source code And while they do a lot of good things with the pre-compiled stuff They didn't have a readme file or any documentation that I could see so I had to get on the internet And it just said to edit the make file So once I saw that it wasn't too difficult, but a simple readme file would have been nice So that's kind of the basics of of installing software Again We hope that you will try it first But you also need to know when to stop beating your head against the wall We want you to do a little of that, but if your head starts hurting That's when to send us email and saying you know either Is this something that you think Uh would be good for other people so we can make a module out of it Or is this something you know, can you help me or do I need enough help that I should come in and see Dave? You know things like that But definitely if you try first then we respect you and then you can come in So we do we don't want you just saying here's my code compile it for me. So yes Okay Yeah, so try yourself first and if you run into problems And depending on the problem you can either reach out to the Developer or you can talk to us and see if we can give you advice But in some cases we may have if it's building the database. We may have to refer you to the developer itself so But yeah, if you run into problems always touch base with us at least so Okay, so that's everything I had prepared. Are there any more questions? So if you want to add any modules to the module And it's like if it's used by Yeah, then email us There are modules out there. In fact, uh, there's I think mother is one of the modules that's not built yet No one's requested it and because it was so hard for me to build we haven't attempted it But there are other modules that are commonly is that we just haven't done yet and we probably won't until someone asks us So I know especially for the bioinformatics people. There's just a ton of them out there and we'll You know, this is Yeah, so Definitely if it's a fairly commonly used one touch base with us We do have these easy build stuff where some of the configuration is already set up And you know, we don't want everyone to redo the same thing Anaconda is in a module Yeah I'm not an anaconda expert. So why don't we well, why don't we answer this afterwards? So, um cpu is usually the Each core supports more than one thread So I just wanted to know your as well as each core support multiple threads or is it each core has a single thread Because sometimes some software allows you to So Run it on multiple threads. Do you mean multiple gpus at the same time? Okay, so Hyper you're talking hyper threading. Okay. Yeah. Yeah, so So just to repeat the question for everyone on zoom The question was about hyper threading We and almost every other supercomputing center turn hyper threading off That means for each core you're running one process on that core Hyper threading can be faster in certain circumstances because if you're doing memory accesses then well One thread is waiting from a memory access. The other thread can be doing work But uh, that messes some other things up in the supercomputing environments It is really nice for a lot of process And so you would have to pump all through that same order. Okay. Any other questions? If you want to load a very large amount of data into memory just for some reason like an index or something that you want to search through How do you propose to load the cat stuff into memory? So probably using a RAM disk would be good. So then you just use a copy to the RAM disk And then you can control things by again asking for the amount of memory that you need And then you use dev shm as your RAM disk And then it's just a copy up and then it's sitting in memory and you'll get the fastest access that way So that's at least off the top of my head Okay, well, I'd like to thank everyone for coming. So I think we had a pretty good turnout between the people who came here and I think on uh, Monday at least we had about 50 people on zoom. So we actually had a good turnout there too. So thank you