 Welcome back to the Riffamonus reproducible research tutorial series. I hope you're able to make it through the last tutorial in which we used Git to document the evolution of our data analysis. As I mentioned in that tutorial, we'll continue to use Git as we proceed with the development of our analysis. I think it's worth emphasizing that to get the most benefit from Git, you really need to regularly commit your changes. There's little point in resolving to use Git, but only committing at the end. That kind of misses the point. In today's tutorial, we're going to follow up with a question that I asked at the end of the last tutorial. If I were to accidentally nuke my project directory, how difficult would it be to get back to where I was before the disaster? I could use version control to get back all over code, but it might be a bit difficult to pull together all the pieces unless I'd already planned for such a calamity. Today we're going to be discussing how we can use bash scripts to automate our analyses. This will minimize our involvement in running individual steps. At the end, I'll intentionally delete my directory and show you how we can get back up to speed. Remember that our primary motivation is to be a better collaborator to our future selves. So if we can pull off the feet of nuking our directory and then regenerating the analysis, then we can be pretty confident that our analysis is close to being reproducible by ourselves as well as others. In today's tutorial, we'll make use of some of the mother scripts that are built into the new project template. You don't need to know how to use mother for this tutorial, but it's a great program that you should definitely be familiar with if you're doing any type of 16s rRNA gene sequence analysis. Now join me in opening the slides for today's tutorial, which you can find within the reproducible research tutorial series at the rifomonas.org website. Before we get going on today's tutorial and talking about automation, I want to remind you of what we talked about in the last tutorial in using version control. And so here's a little exercise that I'd like you to think about with me for a couple of moments. So let's assume we've made a few changes to our readme.md file to update the versions of the software packages that we're using in our analysis. What are the commands that you would enter to commit and push those changes? So go ahead and hit pause and think about what those commands would be and perhaps even write them out. So how difficult was this for you to remember? Again, by repeated practice of writing these commands in actual projects that you're working on, it'll become something like second nature. And so here's what I would recommend doing. So again, we start with git status as the goal being to see what files have changed. Git add readme to add it to the repository to add those changes. Another git status to make sure that we're getting ready to commit exactly what we want to commit. And then we're going to write git commit-m with the message update versions of software packages. Again, we want a pithy declarative imperative statement about what has happened. Finally, we want to do git status to make sure that our commit went through like we'd hoped. And then we can do a git push. You should also note that if you want to be super careful, you might also want to run git pull before the first git status. This will make sure that you don't have any conflicts if perhaps you changed something on the GitHub website before you'd made the changes to your readme file. I always forget that I make these types of changes and so it's a good practice to put that git pull in first. It doesn't really cost you anything and it saves you problems down the road. So for today's tutorial, we have several learning goals. The first is to appreciate the value of automating your analyses. I've already talked about that a bit in the introduction earlier. We're going to build executable scripts that contain analysis pipelines. We're going to differentiate between absolute and relative paths and which one you should be using. And then finally, we're going to identify the limitations of having a driver script like this, which we'll point forward to a few tutorials from now when we talk about other approaches that we can use for automating this type of work. So hopefully you'll remember the noble manuscript that we read a few tutorials back. And in there, he talked about a driver script. And we kind of put that to the side at the moment while we were talking about project organization. Some of the points that he made, as you might remember, would be to record every operation that you perform, comment generously, don't edit intermediate files by hand, use a driver script to centralize all processing and analysis, use relative paths, and make the script restartable. All of these elements relate to what we're going to talk about today of having a driver script. So when we think about recording every operation that we perform, it's important to keep in mind that any analysis you're likely to perform once, you'll probably actually perform twice or three times or four times or five times. And so by having these steps recorded and perhaps made to be automated, it'll make it that much easier in the future. Your mental model as we do this should be that someone is going to download your repository and try to rerun your analysis. So something to think about would be someone, perhaps you, a nuke the contents of your data directories. How would they regenerate figure one? What steps would they have to take to go from nothing back up to being able to generate figure one? Keep in mind that's likely to be you more than it's likely to be someone else. Alternatively, if you were to go home for the night, can you keep the analysis working without having to be at the computer? It's really nice if the analysis is automated so that you can fire up your AWS instance, use Tmux, start your driver script, and walk away and come back in the morning and hopefully your analysis has been completed without you having to do anything more. So if this isn't possible because say you're using a GUI, like I talked about earlier using the ARB software, you really have to take explicit notes that are available for others, and again including yourself. Comment generously. In Bash and many other scripting languages, you can use comments where you put a pound sign before the comment. That pound sign will tell Bash or R or whatever language to ignore everything that follows. We all are code readable and self-documenting as much as possible. And at a minimum, each script or section of a script should have a header that indicates the inputs and outputs and what's going on in this script. Do not edit intermediate files by hand. Again, we want to keep our raw files raw. And so if we start from raw files or files that are downloaded, your script is going to provide the provenance of each derivative data file. That by scripting it, we're automating that analysis and we're not going to write over earlier files. Going back to that paper airplane example, imagine how much more uniform, reproducible and fast the construction would be if it could be programmed. I just love these videos of automated paper airplane folding. Again, this is what we're going for as we think about our data analysis pipeline. We're not so interested in folding paper airplanes perhaps, but we are interested in folding our data analysis, right? We're interested in starting with a blank directory with some code in it, launching that code and coming back and finding that the code's been run and we've got data directories that are full of raw data and process data and eventually perhaps a written manuscript. And so we'll get there eventually. Use relative paths. This first example is what's called an absolute path. And so you know it's an absolute path because it starts with a forward slash. This forward slash at the beginning indicates the root directory. And so this entire path tells the computer how to get to the readme.md file that's in my data raw directory of my project relative to the root. And so if it's relative to the root, that's what we call an absolute path. The second example is what we call a relative path. And so assuming that we're in the root of my directory for my project, say Kozitch re-analysis Sumbio 2013, this tells the computer where the readme.md file is that I'm interested in. It's relative to the root of my project directory. This readme.md then is a file and we'll assume that it's within data raw. So why are we interested in relative paths, which I would say is the second example. Well, if I were to clone this repository to my local laptop, I will not have a home directory called Ubuntu. I have a home slash pshloss, but not home slash ubuntu. And so my computer will throw an error when it tries to find this readme file using this absolute path. However, if I clone the repository and then cd into the directory of Kozitch re-analysis Sumbio 2013, and then I say data raw readme.md, my computer will find that because it's relative to the root of my project directory. However, if I say readme from the root of my project directory, it's going to give me the readme for the overall project, not the one in data raw. You could assume that somebody's going to cd into data raw, or you might provide code to cd into data raw, but that's dangerous because if there's an error at some point and that cd command doesn't get run, then it's going to again be looking at the readme in your project directory. So it's always best to run your commands and assume your files are in a location relative to the root of your project, not relative to the root of your computer, or that the user or you are moving around using cd commands within your project directory structure. So use relative paths. Make the script restartable. So we'll save this for later when we learn about make, but the idea is that if we only change the script to plot the ordination data, we shouldn't have to also then re-download all the raw data. That the automation is smart and it knows where we are and it can keep track of the dependencies. Unfortunately, the reality is that most bioinformatics tools, things like Mother, Chime, Velvet, are run from the command line and they're not meant to be interactive. By interactive, I mean things like Microsoft Excel or Arb, where you're going in and you're manipulating data and you're adding code, you're selecting different sequences, and they're not really meant to be interactive on that scale. They're meant to take in commands, the program then runs the command and it spits out output. The other reality is that you probably also don't want to run big analyses on your laptop. You need to run them on a computer cluster, which is why we're using the Amazon EC2 service. Also, few computer clusters will let you run your programs interactively. You'll end up submitting jobs, as they're called, which then requires special scripts. Again, to think of that motivation I mentioned earlier, where you want to fire up your analysis, say at five o'clock on a Friday afternoon and come back on a Monday morning and find that your analysis is completed. To do that, we're going to be needing to use special scripts to run our analysis, these driver scripts, as Noble called them. Bash scripts can be very straightforward or very complicated. We want to keep them fairly straightforward. It's a script file as a text file that contains commands that you would otherwise run from the command prompt. It's nothing magical. It's a text file that you could copy in all of your commands to then tell as instructions to the computer for what it needs to run. We're going to return now to the Cosetree analysis and add the download scripts to a new driver script to start building up our driver script. As we do this, we're going to be sure to add comments. Of course, we're also going to be using the Amazon EC2 instances that we've been working with on previous steps of this project. I'm going to log into my instance here. This should hopefully be becoming second nature to you. I will open up my terminal, copy the IP address here, and I'll go into my Cosetree analysis directory. If I ls that, the instructions for downloading my raw data are in my data-raw-readme file. I can do nano-data-raw-readme. The lines that I'm interested in are this wget-tar-rm. I'm going to go ahead and highlight and copy these lines. I'll do copy and I'll quit out. Now I'll do nano-analysis-driver.bash. This opens up a new directory. I'm sorry, it opens up a new file and I'll do paste. This is markdown, and I'd like to convert this into comments for my new analysis-driver-bash file. I'm going to put a pound sign ahead of these lines. I can get rid of that and that. Just to make it look a little bit nicer, I'll put some spaces in there. I've called it analysis-drive-bash. I'll rename it analysis-drive-to-analysis-driver.bash. Let's go ahead and also get the reference files for the Silva and the RDP training sets. Those are stored in my data-references-readme file. Again, I can highlight these lines. Copy and nano-analysis-driver.bash. Then I'll come down here and I'll fix my commenting. Comment there. To get rid of this entire line in nano, you can do CTRL-L. I'm sorry, CTRL-K. CTRL-K. It looks like some weird stuff happened here. This should be taxon bacteria in the mother code. Down here, then, I can put a pound sign there. Again, CTRL-K to get rid of that bash and get rid of those three backticks. Again, what I've done is I've copied the text from my readme files into this file-analysis-driver.bash. All of the instructions are here. There is code now for these. Again, if you're using the exact code I am, it should look like this. You might have the code that you used when you created your readme files in the previous tutorial. I'm going to go ahead and save this and exit out. One other subtle thing that we need to add to our bash script here is at the top, a line called the shebang, the whole shebang. That is pound exclamation point, forward slash user, forward slash bin, env space bash. When we run this file, this will then tell the Linux system what to be running, and that we're running bash to get this. We can save that. See, we get some special coloring because we tell it it's in bash and we exit out. Now I can run this as bash-analysis-driver.bash. If I run that, you'll see that it downloads and runs everything. I'm going to speed this up a bit because it looks like it's going to take it a little while. It finally finished. It took a few minutes to run through and download and move things around to where I wanted them. I can type ls, make sure that everything looks clear there. If I do ls data raw, I see that I've got my raw files here. Everything looks on the up and up. If I do ls data forward slash references, again, I see my files here as well. It looks like everything is where it's supposed to be. Now what do we do? If you said we need to commit, you're right. Let's go ahead and do get status, get add analysis driver bash, get status, get commit dash m, and I will say add code to obtain raw and reference data. Get status. We're good to go. It looks like previously I forgot to push this. We'll save a push later, but you might want to push at the end of every day that you're working on your project or every couple hours, again, just as a backup for what you're working on. The next thing we want to do is we want to start analyzing this data using mother. You may have noticed this already, but if you look at ls code, you'll see a number of batch files in there for running mother commands. If we do nano code forward slash get good seeks dot batch, you'll see some mother commands in here. Let me make my window a little bit bigger. You'll see that some of the commands take up multiple lines, but these are all mother commands that we can run by doing mother code get good seeks dot batch. I'm not going to run this right now, but this would be how we could run it for mother. I'm going to hit Ctrl C to get off of that line, because I'm going to put this into my analysis driver batch file. Let's get good seeks batch file again. Just to point out a couple of things, if you're using this for your own data, one thing that you would want to change in here would be that we have in make dot file the prefix stability. The data relates to a stability analysis of murine gut microbiota. Say you're doing something related to soil, you might really change this to soil or antibiotic treatment, you might change this to antibiotics, whatever you want to give a prefix to the files that will be generated by mother. Another thing that we should note when we look at this is that we are using relative paths. The input directory is data raw, the output directory is data mother. We're using relative paths throughout here. One other thing that we want to double check is that we have all the files. We've seen that our data raw files are there. One file we might want to check is to make sure the Silva v4 aligned file is there. We also want to make sure that our data references train set file is there, as well as the pdsFASTA and taxonomy files for that. Let's Ctrl X out of that and do LS data references. We see the train set files here, the FASTA and the tax, but we don't see a Silva v4 file. To do that, to generate that file, we need to run some other code. Let's go ahead and we will do nano driver analysis driver.bash. I'm going to come down to the bottom of that file and I'll create a comment that says generate a customized version of the Silva v4 reference data set. To do that, we'll do code forward slash mother forward slash mother space pcr.seqs I need a pound sign. This is a command line version of running mother where you don't have to go into mother, but you can run commands from the command line with mother. We'll do fast dot equals data reference forward slash Silva dot seed dot align start equals 11894 and equals 25319 we'll do keep dots equals false and processors equals 8 and then we'll close that with closing parentheses and quote and we'll also then do MV data forward slash references references which reminds me getting scatterbrained here that I have a reference here data references so fix that data references Silva dot seed dot pcr dot align and we're going to move that to data references forward slash Silva dot v4 dot align and so again this is some inside baseball if you will about running mother but we can generate a version of Silva seed align which is in our data references directory that targets the v4 region and so if we save that and exit out just want to briefly show you if you come generate a new tab and we go to mother dot org wiki main page that there's a link here for the my seek SOP and what we're following in this code is all embedded in this wiki page and so somewhere in here further down right here we have the pcr seeks command that we just put into mother as well as renaming the file so we could have used rename file but we did that with an MV and so that should work pretty well and so if you're wondering where all the commands come from in those batch files I've given you they're all listed here so we could do is we could again do bash analysis driver dot bash but that's going to download everything again and so this is one of the problems with our analysis driver bash file as we currently have it so I'm going to do control C and one of the things that we can do is again do nano analysis driver bash and I'm going to come down to the bottom of this file and I'm going to highlight these lines copy them quit out and then paste the lines into the command prompt and it will then run my mother code it's running the pcr seeks command and so it runs through aha and it says cannot stat which means it can't find data references silver seed pcr line knows such file directory you see what I did wrong I misspelled references so let's go back into nano analysis driver bash and let's fix the spelling of references save that again copy and paste these lines paste them back in and hopefully we run this and we don't get any error messages so that works great again if we do ls data references we see that we have the silver v4 aligned file okay great so what do we want to do let's commit it so to get add analysis driver bash get status get commit dash m add code or we'll just say generate silver dot v4 dot align file right great now we want to add in some more of those mother commands so the nano analysis driver bash come to the end of the file and we want to run mother through the data curation steps and so to run that we'll do code forward slash mother forward slash mother make sure we spell it right and then code get good seek dot batch and so we'll go ahead and save this and just to prove to you you'll notice in this nano file that we're running code mother mother that the path to mother is in code in the mother directory within code and so if we do ls code we see there's a directory for mother ls code mother we see that there's various executables and so if we do code mother forward slash mother we see that we open up mother and so again this proves to ourselves that mother is where we think it is we can type quit ls and we can get back to where we were so again if we go back to that analysis driver file we can highlight this line of code copy it and then run it in our terminal window and so again what I did was I pasted that line in and now mother is going to run this and it's going to take a bit of time and so I'll try to speed up the output here so great the analysis finally finished as I said I sped it up here what you might notice is that I lived on the edge here and did not use Tmux to run the analysis so running Tmux probably would have been a bit safer because if for some reason my internet would have gone down the analysis would have kept running even though I would have been logged out of EC2 alright so the next thing we'd like to do is to run the next batch in our analysis workflow so if we do ls code we can see that the next is in the pipeline is git error batch so I'll do nano code git error dot batch and looking here at what happens this uses the mock community data in here to run the command seek dot error but one reference file it needs is HMP mock v4 fast day so let's double check that we have that in our reference file so if we do ls data references I do in fact have an HMP mock v4 dot fast day file so we should be good for that so let's go to our analysis driver dot bash file and again scroll down to the bottom and run mock community data through seek dot error to git sequencing error rate okay great so we'll do code mother forward slash mother code git seek error I'm sorry git error dot batch again we don't want to run the whole script right now because it's going to take forever we'll come back and we'll do that at the very end so we'll copy this and paste it in so the next step in our mother pipeline is to run git shared otus dot batch so let's open that up and see what it looks like and so we see that we're going to remove our mock community samples we'll run it through clustering we'll then make a shared file and we'll run classify otus so the other thing to keep in mind is that if you way back when we ran make dot contigs we used the prefix of stability so if you change stability there you're going to have to change it in all of these batch files but that's the only place you would change it is in the first command from stability to something else so we'll get out of that and do nano analysis driver batch and we'll say something like run process data through clustering and making a shared file code mother mother code git shared otus dot batch and again we'll copy that and we'll do paste it so now we've run everything through mother and we've gotten our shared file which if you're familiar with mother you'll know that shared file is the file that we use for all the downstream analysis of looking at things like alpha and beta diversity we also have our error rate data so a couple of things that we still need to do include generating an ordination for that figure that we want to be able to put into for that figure that associated with the paragraph that we pulled out of the manuscript and we also need to rarify the data to get the number of otus that we are seeing in our samples let's go ahead and commit what we've done so far and do git status where we've modified our analysis driver dot bash file so we'll do git add analysis bash git status git commit dash m load pre-baked mother batch scripts to driver script status alright so we're all up to date so the next thing we'd like to do is to build an ordination diagram and to do that we're going to be using our shared file and the mother file names as you may know can get pretty long and so let's go ahead and find that file so if we do ls data forward slash mother and it ends in shared so we'll do star shared and we see this long file name so if we copy this it'll make it a lot easier to enter the command so let's make a new file so we'll do nano code forward slash git nmds data dot batch and so what we're going to do in this script is we're going to run the mother commands to generate the file that has the nmds data that will eventually plot using r in the next tutorial so we'll do set dot current input dir equals data forward slash mother output dir equals data forward slash mother and we'll do set we'll do seed equals nineteen seventy six zero six twenty and then we'll do dist dot shared and we'll then add our shared file name so shared equals and I'm going to paste it in that and then we will say calc equals theta yc sub sample equals a thousand I'm sorry three thousand and I'll do itters equals a hundred and processors equals four so if you don't know what this all means don't worry about it this isn't a tutorial to teach you mother but showing you rather how you can use mother commands within a batch script and so if you're interested in learning more about this go back to that my seek SOP page and a lot of this will make sense after you look at that finally we can do nmds filip equals current max dim equals two so we'll save this and quit out and let's go ahead and test it by doing code forward slash mother forward slash mother code forward slash get nmds so if we run this what ran through generating the distance matrix for theta yc beta diversity calculator and then use that to generate nmds data and we'll play with that nmds data in a subsequent tutorial when we go ahead and plot that using r so we need to add that to our analysis driver file and so I'm gonna up arrow to get the previous command and I'll highlight that I'll hit ctrl c to get off that line I'll do nano analysis driver bash and I'll come down to the bottom to then say generate data to plot nmds ordination so we need to commit this change to get status and so we want to do get add code get nmds batch we also want to add the analysis driver dot bash file get status both of those have been staged we can now do get commit generate nmds ordination data excellent so again I need to get the long shared file name so if I can do ls data mother star shared I get this long beast of a name that I'm going to highlight and copy and then I will do nano get sobs data dot batch nmds sobs is our shorthand for the number of observed species or otus or taxa and again we'll do set dot current input dir equals data forward slash mother output dir equals data forward slash mother seed equals 1976 0620 and then we'll do summary dot single shared equals that beast we'll do calc equals nseqs sobs so this will tell us the number of sequences that it was rarefied to and the number of observed otus and then we'll say subsample of 3000 again and this will rarefied the data to get us 3000 sequences per sample we can save that and then we can do mother I'm sorry code forward slash mother forward slash mother code get sobs data dot batch excellent so that ran also so we're going to now copy this code that we ran into our driver file and we can go ahead and commit that as well great so let's go ahead and look at the history of our commits and we can do get log and we will see that we are keeping pretty good track of where we are in our data analysis pipeline and so if we wanted to go back to any point in our code history we could perhaps roll back a repository to a previous commit but we're pretty happy with where we are and this is looking really good so to get out of this view of the log you can type q and this brings us back to the command prompt so we've been doing a lot of copying and pasting into and out of our analysis driver batch file what we'd like to do is to make sure it all works but I'm going to up the game a little bit more I'm going to delete my directory and see if it all works there's one thing before we do that so I'll do get status and it says that my branch is ahead of origin master which is the github version by nine commits so use get push to publish your local commits so I'm going to go ahead and push it so I'll do get push enter my credentials okay there we go and so then if we come over to our github page I'm going to find my cosetree analysis repository just make sure everything's been updated and sure enough my analysis driver batch file is here if I click on that I see all my code here which looks great and I'm going to go back to my directory here and just give it one last look I'm nervous about doing this I'll do cd tilde ls I'm going to do rm-rf cosetree 3, 2, 1 delete ls oh no it's gone what do I do hopefully you remember from last time there's a command in get called clone we can do get clone and I can come back to my repository here and I can click on this green button copy the link paste that in and then I can do cd cosetree ls and I see everything's there but if I do ls data forward slash raw nothing's there right so we need to regenerate all the data all the analysis so I'm going to go to lunch so I'm going to run this in tmux so we can do tmux we're here and then we can do bash analysis driver.bash hit enter and away it goes so remember we can get out of this by doing ctrl B D and we're back here and I can exit out and and here we go now we're so and see where things are note that for now we do not want to stop this from running because it's running so while the script was running I realized that we never added instructions to the analysis driver script for how to get a copy of mother so what I did was to go into nano mother I'm sorry nano code and copy and then open up the analysis driver batch and paste it into the top of the file here so because we were getting error messages that it couldn't find a copy of mother we now have to rerun everything again so this is exactly why we do these types of scripts because we want it to be reproducible if you were to download this and it didn't have the mother code installed in a world of hurt it's great that we rerun it it's great that we saw that we were assuming some dependency was present being mother and it wasn't so we need to now go ahead and run it again I've already committed it let's go ahead and run batch analysis driver dot bash run it and we'll come back in another hour alright so let's try that again so we'll do tmux a and we see that it's getting to the end I think it's in the rare fraction step so we'll just sit with it here for a minute looks like everything's working well and voila it is done great isn't it awesome that we can automate our analysis this is looking pretty darn close to my April Fool's joke of a mother command called write dot paper you might think it's funny but in two more tutorials we'll actually write something that looks a lot like write dot paper look at the project that you're currently working on in your lab how automated is the analysis do you have a single file that describes the workflow even if it isn't full of bash commands it would be great to have a document that lists the steps and the commands that one would need to follow to go from raw to process data one thing you might notice about this approach is that it doesn't leave a lot of room for tools with a graphical user interface like Microsoft Excel those tools are fine and I don't mean to cast shade at them but hopefully you can see by now their limitation for doing a reproducible analysis it's really hard to document all of those mouse motions and the toggles that need to be said to repeat an analysis in that environment in the next tutorial we'll see how we can use scripting languages like R to further automate our analysis