 If you're like most researchers in the biological sciences, to this point in your career you've lived pretty contentedly doing everything off of the computer that's sitting on your desk. You've done things like email, you're on the internet, heck you're watching this video on your browser, right? Perhaps you're using Microsoft Word or something like that to do your word processing using Microsoft Excel or Google Sheets to again do some number crunching, perhaps something a little more sophisticated might be using GraphPad Prism or hopefully I've convinced you to start using R but again at some point you're going to need to mature your skills a little bit more. You're going to have more data. Right now I guarantee that most people in biology have too much data for their own good. So if you project ahead another five, ten years do you think we're going to have more data or less data? I'll give you a second to think about that. Yeah of course you're going to have more data right and so you're going to need more sophisticated tools to work with that data. Now all those tools I mentioned are very sophisticated and they've really adapted to kind of the changes in computer technology and our needs. That being said the software and the hardware that we are using to do all these things are designed for the general population. They're not necessarily designed for biologists and so perhaps to this point maybe a couple years ago that served us very well. But what you've perhaps noticed is that we are now in an era of big data in biology right. We have massively high throughput sequencing people doing genomics, amplicon sequencing, metatranscriptomics, immunologists even are doing flow cytometry, they're doing luminex assays, metabolomics has the potential to also generate large large large data sets. Colleagues that do a lot of image analysis are generating massive files of pictures and images that they're trying to process with their computers. We're not going to be able to analyze that on our local computer at calls for some heavier hardware and a topic that I'm going to introduce to you today if you haven't heard it already of a high performance computer also called an HPC. As you may know my lab spends a lot of time and energy developing a software package called mother that's widely used in the microbiome field. As part of that I do a lot of instructional presentations to teach people how to use mother and so as people are getting ready to take those workshops or if they're just getting into microbiome analysis in general they email me and they say hey Pat what kind of computer should I get this is what I currently have this is the type of data I have and I always feel really uncomfortable answering that question because it's like I'm trying to spend their money and I do not want to spend anyone else's money and I really don't want to spend my own money and so that's a hard problem because the data sets are going to evolve with time you might want to buy a computer that you can use for a long period of time you probably want to use your computer for other things right so a lot of the analyses we do might take hours or maybe even days to run and so that's going to kind of tie up your computer so you can't do things like you know watch Netflix and so questions that come up would be things like how much RAM how many CPUs how big of a hard drive right and so these are all features of computers that are constantly evolving if you buy a computer today by the time it shows up it's going to be obsolete right and so asking me what you should get you know I'm going to say we'll get the biggest baddest computer you can get that's not a really good answer the alternative as I've mentioned already is a high performance computer an HPC now there's two general approaches to HPCs now many of us have an HPC at our home institution the University of Michigan has one called Great Lakes it previously had one called Flux but Great Lakes is a massive computer made up of hundreds of smaller computers also at the University of Michigan we have standard nodes or standard processors that we can run things on where maybe we get six or seven gigabytes per processor but we also have high memory nodes where we can get perhaps a hundred gigabytes or so of RAM per processor for jobs that require a lot of for things like genome assembly perhaps something that we have found sometimes requires a lot of memory but again the value of an HPC is that I can run a job on that high memory processor for a short period of time without having to buy the computer myself and then I can put everything back on a lower memory set of processors again making life cheap and it's all done in a very seamless and easy way another approach is to use Amazon web services AWS where you can buy computing resources on Amazon servers and do your processing there both options are very cheap and in fact far cheaper than buying a dedicated machine about 12 years ago when I joined the University of Michigan I spent a lot of my startup funds to buy computers that were mine and that no one else could use and so I just loved knowing that I had my own resources that if I had to I could go find and point to and say those are my computers no one else can use them and as the high performance computing resources at Michigan got ramped up they were always saying hey Pat why don't you join Great Lakes why don't you join Flux whatever it was at the time and I was like no no I want my own thing I don't want other people touching my computers which is kind of a really stupid mindset and so finally I asked the guy you know I use these computers a lot I think can you tell me what percent of the computes of you know the time this computer is on am I actually using it and what they told me totally surprised me we were only using about 10 percent of the computer and so if the computer was on for 24 hours in any average day we would be using it for 2.4 hours right and that because there's days that we're not using it days that we're using it heavily but on average over a month over the years we used it for 10 percent of the time and so what that means is that if the computer was $10,000 that I was effectively paying $100,000 for that computer because we just weren't using it the high performance computer and Amazon like it means that I can pay to use a CPU by a unit time and so this is the motivation for a high performance computer like Great Lakes at Michigan or Amazon Web Services is that we take a whole bunch of people that perhaps need 10 percent of a computer and you know you get 10 people together and they can use one computer right well across Michigan there might be hundreds of researchers and for Amazon there might be thousands of people that are all using different parts of their computer resources making it cheaper for everybody besides the saved costs on the computers I can also save the costs on my laptop or on whatever computer it is that I'm using I don't have to get the latest greatest laptop for performance to run mother to run R to run all these other processes I need a computer that serves my recreational needs perhaps that serves my your word processing or email or whatever right things that aren't so compute heavy and don't have such high requirements because I'm using that computer to log into the high performance computer right so I could get a Chromebook and I could log into my HPC heck I've used my iPhone to run mother off my phone which is kind of trippy and really not very convenient but anyway you get the idea right that the computer I'm using now is a terminal to log into the HPC and so I can save costs on my local computer now again if you're a gamer and you want like a totally tricked out computer that's cool but you don't have to get a tricked out computer to do bioinformatics you can buy into a tricked out computer that is really cheap and very affordable and convenient to use the challenge though is that it's convenient to use and the resources are available it's not necessarily easy to use and it's not necessarily something that is super intuitive and so today I'm going to solve one of the challenges with using an high performance computer that you might find at your local institution for now I'm going to ignore Amazon web services but know that that is a totally viable option that's also very affordable if you're interested in learning more about that definitely leave me a comment down below and I'll see what I can do about putting together a parallel episode to this one using Amazon web services now one of the challenges about teaching how to use a high performance computer is that every institution's HPCs is going to vary wildly things like how do you get an account how do you log in how do you install software these types of mundane things are variable across different institutions and while there might be some commonalities I'll leave that to your own local systems administrators and educators to teach you how to do those tasks one of the things that's very common though across all HPCs is that they generally all use some type of resource allocation management software university of michigan uses one called slurm previously we used one called torque let me know down below what your institution uses are you using torque or are you using slurm like the university of michigan what are these resource allocation tools they're also called workload managers so what they do is they allow you to get access to different parts of the computer of this massive high performance computer and so you're probably asking what do you mean I can't just like fire off and run a command at will no so first of all one thing to know is that they are typically going to be command line interfaces and while sometimes there are graphical interfaces I'd say 95 percent of the time it's going to be a command line interface so you need to learn the command line right so that's one more thing to learn and that's a challenge the other thing is that if we have a hundred researchers that want to use this resource we don't want them all kind of going at that resource at once right so if there's a hundred processors we don't want them all kind of fighting for those hundred processors we need these workload management tools like slurm to say well you know you need this much of a resource and you need that much of a resource and so we're going to allocate the resources to you so that we can get the most jobs done in the quickest amount of time and also prioritizing people that came first again the people that manage the HPCs can alter who gets priority and who doesn't again the idea of a workload management software tool like slurm is that I can tell it for this program that I'm trying to run how much ram do I need how many processors do I need and how much time do I need to execute that now I don't know exactly how much ram or how much time I'm going to need to run it I'm going to set a ceiling on how the resources that I require then what slurm does is it looks at the jobs I'm trying to run those scripts are called jobs so it looks at those jobs and their requirements and it looks at my friend Evan's jobs and my other friend Jenna's jobs and it kind of looks at all the different requirements and figures out the best way to allocate those jobs to the resources that are available now every institution is different in how they manage their HPC they have different rules on how to establish priority they have different rules on the maximum time you can request or the number of processors you can request or the amount of ram you can request so like at the University of Michigan the longest time that I can run a job for is two weeks and you might be thinking oh my gosh two weeks that's forever well I've had jobs in the past that I needed to run actually for a month and so in those cases you kind of need to work with the systems administrators of your HPC to figure out a solution again the rules for establishing priority and the ceilings on the different resources you can request are all going to vary by institution and who's running the high performance computer that you're trying to get access to but they're all basically the same kind of idea right the same concepts that we have this tool you know slurm or torque and they help us to get our jobs onto the processors onto the computer in a fairly equitable way that keeps in mind kind of the resources that are needed and the resources that are available and the remainder of this episode I want to share with you the three approaches that my lab uses to run jobs on the our HPC so the first is an interactive mode where I'm directly entering the commands at the console and it then runs them on these compute resources you might be thinking isn't that the way we always do it hold on a second approach is a batch mode where I tell it the command to run and then I go away and come back when it's done a third approach is also a batch mode but it's a batch mode that I can create an array so I submit the job but it then does a hundred things or a thousand things all at the same time each time with a different set of parameters so we'll call that an array job so let's head over to my terminal where I'm already logged into great lakes so I can show you these three different ways that I would run slurm using the code that I've already made for my microp ml demo so the first approach is an interactive mode so again you might be thinking interactive like don't we always run things from the command line yes but in this case we're going to use the interactive mode to launch a job on what we'll call the compute nodes right now I am currently on the head node of my computer so this is the node that I logged into this is the computer that I'm logging into and so slurm then takes my job from the head node and distributes then to those compute nodes the interactive mode allows me to interact directly on that compute node without having to submit a job that I kind of walk away from we might walk away from it but we'll be able to enter it in real time so I have a script in here already called interactive dot slurm so let me use cat to show you what this looks like so this is cat interactive slurm I have a command called s run I give it my account name which is account equals p shloss one the partition I'm on the amount of time I need which is I believe this is two hours the number of tasks one task cpus per task one nodes I'm using one node so there's multiple cpus per node and I'm requesting six gigabytes of ram which is probably far more than I actually need and then it gets kind of wrapped around the screen here but then I do this dash dash pty forward slash bin bash that basically tells slurm to fire up bash this command bind environment on that compute node now that's a lot to remember to type in every time so I make this script interactive slurm I put a shebang line at the top of it so I can then execute it and I have typed this in once and I then move this script from project to project so that it's good to use in all those different cases I also if I do an l lsl th on my directory you will see that my interactive slurm is executable right so I can do a period forward slash interactive slurm and it puts my work now in the queue and it's getting ready to fire off those resources it maybe took 20 seconds to find the resources and then put me on that queue now I it looks like I'm at the head node still but I'm really logged into one of the compute nodes and you can see that I went from being on gl log in two which is a login node to being in gl 3062 which is a compute node so again from here now I can do r to go into r I could you know do two plus two to get my numbers of four I could also source code that I've developed before so code forward slash genus by genus analysis dot r I can run that and I can look at the output that comes to the screen as it runs through that r script this is an interactive mode where again in real time I can enter commands into a program like r but without sucking up the valuable resources on the head node which people are trying to use to be able to fire off their jobs to these compute nodes so again we're on the compute nodes and it's asking if I want to include something so I'll go ahead and say yes and it'll install that and I'll move forward so again I'm in the interactive mode and sort of quit out of our r I would do queue and I'm still on the compute node and so to get out of that interactive mode I'll again type exit to quit out of the compute node and come back now to my login node so again that's the interactive mode I will go ahead and be sure to put interactive dot slurm in the repository so if you click the description down below that will take you to a blog post associated with today's episode which will then show you how you can get this interactive dot slurm and all the other slurm scripts that I'm going to be showing you in today's episode the second approach that I want to show you is how to submit a batch job in this version of a batch job I'm going to show you how to submit a single command or a set of commands that you would run in series so in this script I have again at the top is the shebang line to run this script it's got a series of s batch commands again they start with a pound sign normally that might mean a comment in this case it's telling slurm what to do and so there is a series of maybe 10 or 12 lines here with instructions to slurm on how to run the job so the first is my email address so before I post this I'm going to remove that email just because I don't want to get emails about your jobs but this will then be the way that you get notified of when your job starts and when it finishes it'll send you an email so here it says beginning and end and if you just wanted at the end you would remove the begin if you just wanted the beginning you'd remove the end anyway you can then say your resources right so how many cpus do you want per task here I only want one um actually I'm going to change this so I'm going to ask for 16 cpus um and that way I can put 16 seeds on each of 16 different processors um much like the way we ran it on my laptop but again doing it up on great lakes um we'll use one node one task per node probably four gigabytes per cpu is fine previously when we did the interactive mode I think we used six I'm asking for 24 hours I'll go ahead and just make it two hours pshloss one is my lab's account that we all uh get access to we're on a standard partition and then this is some text to format the name of the output file so things like the account and the partition again those are things that your um systems administrators are going to have to tell you what to enter now for the command I'm going to go ahead and do make processed data forward slash l2 genus pooled performance dot tsv and so that is the command that I want slur them to run for me again to get it to use all 16 processors I need to do make hyphen j 16 and so then that'll take the dependencies that are required to build this l2 genus pooled performance dot tsv file there's a hundred dependencies right and so this is saying use 16 processors to do that so I'll go ahead and save that and then I'll quit out of nano to run this I can then do s batch and I can then do single dot slurm normally I might give it a more descriptive title so I know what's going on I can go ahead and run that it's submitted I can now look to see what's going on in the queue so I can say s q u e hyphen p schloss which is my account name oops sorry that should be s q u e u p schloss and so it says this is my job id it's on a standard partition this is the name of the file the user the account how long it's been running for 14 seconds so it fired off pretty quickly and it's up and running um I can do something like lslth data process data so it looks like it's running but it hasn't outputted anything yet because again it takes 90 seconds or so um for the job for each job to complete if I do an ls I see that I've got single dot slurm um dot o and a bunch of numbers uh this is the job id right so I can do a nano on single slurm dot o um tab completion to the rescue um and I can now see everything that's being outputted to the screen and that it's fired off those first 16 jobs um this gets pretty messy pretty gnarly um so I'll chill out and let it run it should just take a couple minutes to run and we'll see what happens so I got both my beginning email about 12 minutes ago and my completed email um from the slurm system and as you'll see it says ended runtime 11 minutes 36 seconds completed exit code zero which means everything was good if I do an ls-lth on my process data directory I see all those um intermediate files the the dependencies for building the pooled performance in the pooled hp file um again this all looks really good and we have all 100 of our l2 genus whatever the seed was for the random number generators dot rds files and that worked well so again that was how we've created a batch job uh for a single uh command to be run now um we could do multiple lines here right we could enter beyond this just this make one we could put in other make rules or other commands that we might want to run after that and those would all be run in series something you might think about was well maybe instead of doing 16 processors let's do 100 processors and that way then we can put each of those random seeds on its own processor and then that way it would be done in like 90 seconds because each each each iteration each seed should only take about 90 seconds or so so that's what we're going to do with an array job so again if we look at nano array dot slurm so we see that array dot slurm looks very similar to that single dot slurm script that I used previously we have all the same s batch commands setting up the resources that we need um and what you'll see down here the last line in this set of commands is array 1 to 10 and so what this says is create an array basically a hundred different jobs 1 to 10 I actually want 100 so I'm going to change that to be 1 to 100 and the array value for that particular job so we'll submit this like we did with the single dot slurm and it will then instantly spawn 100 different jobs it's pretty cool and so it keeps track of which array value it is which job is it in that array of 100 and that is stored in this variable slurm array task ID and so we write it with the dollar sign and then a pair of parentheses and we can then echo the seed to each of the individual output files that we have and so what I can do is I can say make processed data forward slash l2 genus underscore and then dollar sign seed dot rds okay so that is the rule that I want each of those spawned jobs to run right and so where it says dollar sign seed it's going to plug in a value from 1 to 100 and fire those all off and that will be then good and then we'll come back and we can run again that single dot slurm file script that we ran to kind of pull them all together and pull the values I'm going to go ahead and save this and I also need to remove all of these older rds files that I just created because the current if I ran them now with make it wouldn't run make because these files are already up to date so I'll go ahead and do rm process data l2 star and I'm looking at the contents it's empty we're good to go so again I'll do s batch array dot slurm cross our fingers that everything works and away we go I can also then do sque hyphen up sloss and I see that that is loaded up and it's getting ready to go you'll see that it's got a job ID and then in the square brackets 1 to 100 so now if I look at that output again I see that I have 20 other sub jobs run running at the same time again it says there's 20 openings we'll go ahead and plug those jobs in for you pat and I can kind of neurotically click on these periodically and see how long it takes to get all the jobs loaded and I think I probably have all hundred of my jobs running currently and it should just be a minute or two before they're all done so that all ran and went pretty well took about two minutes to run for all of those things to find resources to use and then run those seeds on the hundred different processors so very quick very efficient not that it was that so long earlier but again you can imagine with a much larger job with like say random force where things might take a few hours to run being able to run a hundred of those jobs all in parallel at the same time will really speed things up versus having to put some things in series all right so we've got those hundred rds scripts created let's go ahead and double check that we'll do process data and we see the hundred rds scripts for the l2 genus for those hundred seeds and I can then go ahead and rerun my single s batch slurm file so I'll do s batch single slurm we'll run that that's submitted we can do look at the queue to double check everything is good it should just take a moment or two and then we'll be good to go so again the job completed compiling all those hundred seeds together I can do ls lth on my process data to see those hundred seeds as well as the genus pooled performance and hp.tsv files we could then use this for doing our visualizations and downstream steps but again what I wanted to emphasize in this episode is how we can use this resource allocation management tool slurm to run an interactive job a single batch command and then an array batch command torque has all the same features but slightly different context I would encourage you again to talk with the people running your system their system administrators for your high performance computer to kind of see how you might translate between slurm and torque I would do an episode on torque but we no longer have access to torque I think slurm is a bit more popular anyway but anyway definitely be in touch with the systems administrators of your system a final thing that I want to do is clean up and so one of the downsides of slurm and torque is that you get all these output files that I really don't care about right so I could you know look at one of these and do nano on that file and I could kind of scan through here and see that it's all the same kind of messages that we saw when we are running it in our studio directly so I'm going to go ahead and delete those files so to do that I'll do rm array slurm dot oh and so I'll put in a star to remove all those and also do remove single slurm dot oh and then that start of the tag and so now I have a clean project route directory and we're good to go for the next step in our project so again if you want to get my slurm scripts be sure to check the link down below in the description that will take you to a blog post showing you where you can get those slurm files does this convince you the value of working on a high performance computer frankly at some point you're not going to really have a choice because you're going to have so much data your analysis is going to be so complicated that you don't really have a choice but to move to a high performance computer maybe your institution doesn't have a high performance computer in that case I'd strongly encourage you to try to use amazon web services please let me know down below in the comments whether or not you would like to see a similar type of episode that I've done here but doing it for aws let me know and I'll see what I can work out keep practicing with this try to use this with your own project and we'll see you next time for another episode of code club