 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence to ask and answer questions about the world around us using data. Normally, when we think about computer programs, we think of something like a web browser or an app on your phone. In the past few episodes, we have used a variety of programs, including Git and Atom. But a computer program is really only a set of instructions telling a computer how to do something. It can be simple or quite complex. I think of a computational research project as writing a computer program. At first, it doesn't seem very complicated. A few commands to get my raw data, a few more to process the raw data. And finally, several commands to generate figures and run statistical tests. But by slowly building up the project, it becomes a pretty complicated computer program. This is not a metaphor. We have the ability to piece together different programs as well as programs we write to instruct the computer how to convert raw data into a finished product. You might think you can't possibly write your own computer program. You can. In today's episode, we'll take the first steps of converting our work from the past few episodes into small programs that we can execute from our command line interface. We may not always appreciate it, but when we're working at the command line, we're using a terminal program, but also a program that provides the command line interface called BASH. Today, we will learn how to write simple programs, also called scripts, to automatically download our data and put it in the correct location. As we go forward through our project, we will use this approach to create an automated and reproducible workflow. Writing these types of programs is critical to achieving our overarching goal of understanding the sensitivity and specificity of Amplicon sequence variance. Even if you don't have a clue what Amplicon sequence variance are, I'm sure you'll get a lot out of this episode. Also at the end of the episode, I have several exercises for you to work on. If you haven't been following along but would like to, please check out the blog post that accompanies this video where you can find instructions on catching up. Now, please take the time to watch today's episode, follow along on your own computer and attempt the exercises. Don't worry if you're not sure how to solve the exercises. At the end, I'll provide solutions. In the notes below this video is a link to the blog post for today's episode. This will include installation instructions, reference notes, and links. They're meant to be a supplement to the material in the video. Finally, don't forget to like and subscribe the Riffamonus channel. Please click on the bell icon when you subscribe so that you're notified when the next episode is released. In the last several episodes, we've worked with downloading our raw data and our reference files as well as documenting the steps that we took in our readme files. Now, that's great because we have a documentation of how we got them, but even better would be an automated script that when we run this script does all those steps for us so we don't have to worry about copying and pasting or retyping things and causing errors. That's the goal for today is that we wanna create an issue and go work through the issue and then close it that converts the information our readme files into automated scripts. Along the way, we'll learn quite a bit about bash scripting and different tools that we can use to make our more robust analysis. So I'm over here in GitHub and my project repository. I'm gonna go ahead and create a new issue which will be to convert instructions from readme files into bash scripts. Okay. And I think this is pretty self-explanatory, but let me just list out what we need to be sure that we've included in our bash script. So we need to download the silver seed reference and download, I guess I should perhaps say, download extract and place that, the RNDB files. And I'll go ahead and use our cool markdown check box from last time. I think that needs to be a space between those brackets and if I preview it, that looks good. Okay. So submit that issue. This is issue eight. So I will come over into my terminal and I need to navigate to my project home. So I have that in documents under Schloss RR and analysis XXX 2020. Get status, I'm on the master branch. Everything is good, we're up to date. I'm gonna go ahead and do get branch. I'm gonna go ahead and do get branch issue eight and then get checkout issue eight. So it says I'm on branch eight. Again, just be careful. I always do a get status. I'll ask to see where I'm at and we're good to go. Okay. So again, what we wanna do is we want to create bash scripts that downloads the various files that we have. I'm gonna use Atom to do all this work because working in Nano as a text editor is a real pain in the butt. All right. So let's, the first thing I wanna do is as a test, I'm gonna create a new bash script in my project root directory that I'll call hello.sh. And this is gonna be the first bash shell script that we run. The first thing that our shell scripts need is what's called a shabang, S-A-G-B-A-N-G, shabang, the whole shabang. I don't know if those expressions are related, but that's pound sign exclamation point, forward slash user, forward slash bin, E-N-V, bash. So when we run this script, that line, the shabang line tells our computer what to use to understand the code that follows. And so every, whatever you learn a new programming language, the first program they have you write is a program that prints out hello world. So we'll do the same. So we can use echo, which is a bash program or function that outputs whatever follows the word echo. So we'll say hello world exclamation point and I'll go ahead and save this, okay? So again, our shell script is very simple. It has two pieces of information. The first is the shabang line, this first line here, which again tells our computer what program to use to understand the code that follows. And the second is our instructions. And in this case, we're using the echo function to tell our computer to say hello world. So I'll save this and I'll come back to my terminal. And if I LS, I now see that I have hello.sh. So something I wanna introduce to you is another argument for the LS function. I think that the very first episode of this series on Amplicon sequence variants, I talked about LS. So if I do LS space hyphen L, that gives me a long listing of the output. Now normally I do LS hyphen LTH. And so the T means sort the output by the time that it was created, so the most recent stuff up top. And the H means to put the file sizes of this column here in more human readable output, okay? So hopefully you can see slightly different output. Initially it was alphabetical with all caps with capital letters first, followed by lowercase. And here, again, we're sorting it by date modified. One of the other things that you'll see on the left side here is a series of information about the files. And so if you see an R, that means that it's readable. If you see a W, that means it's writable. We don't wanna go too far into it, but it also gets into different permissions of who can read, who can write, who can do whatever. And if you see a D, that means that that is a directory. Okay, so we don't want it to be readable or writable. We want it to be executable. So what we can do is chmod plus x hello.sh. So what this means is modify hello.sh to be executable. The x means executable. Again, I do lslth and now I see that I've got these x's in this first column of my file. What that allows me to do now is if I do period forward slash hello.sh, it says hello world, okay? If I don't want it to be executable, I could do chmod minus x and it removes the executable. And then if I run it, it says permission denied. So again, we want it to be executable. So plus x hello.sh period forward slash hello.sh. Very good. We've written our first program in bash. Well done. That's pretty silly. That's not gonna get us very far in our data analysis. If we come back and we look at our data references, if we look at that readme file, we see that we had this list of instructions that we created, included these three lines to download our silver reference file to create a directory to store it and then extract into that directory. What I wanna do is in my code directory, I'm gonna create a new script that I will call get silver seed.sh. Again, we don't want spaces in our file name, so I use underscores and this is going into the code directory. I can then copy those lines of code and again, I can do my shebang line at the very top. User bin env bash. We save that, that looks good and I can then do chmod plus x code get silver seed.sh and now I can do period forward slash code get silver seed.sh and it runs, it says the file exists, the tgz file is already there, it's not gonna retrieve it, the directory already exists, so it's not going to create it and then it extracts those files into my directory. Very good, that's great. So again, we've automated updating our silver seed file. Now, that's great, but this doesn't tell us a whole lot. If I were to come back to this file in a few weeks or a few months, I might not really remember what's going on here. So what I can do is write a comment to indicate who wrote it, the name of the program, what the inputs are, what the outputs are, a little bit about what it does and so at the top I always like to put this. So I might put author, patch loss, inputs, none, outputs, places, the silver seed reference alignment into data references forward slash silver seed and also we like these to be maybe only 80 or 100 columns wide and I'll also put in here some information about what's going on here. So we download this version of the silver reference to help with aligning our sequence data. This is version 138. I think this came out in 2020, maybe 2019, but again, for illustrating comments, it doesn't really matter because the TGZ file contains a readme file. We extracted to a directory within references, in data references, okay? That's enough, right? And again, it's enough information so that I can come back to it and understand what was going on here. Very good. So I can save that and again I've already modified the permission, so again, if I do LSLTH, code getSilva, it's still executable and again, if I do code getSilva seed.sh, everything's good, all right? So great, we've written our first real script for downloading and processing data. Let's turn now to those RRNDB files. So I can close this and close this and if I go to, that was in raw, readme, I have these series of four sets of code to download different files. So I'm gonna start with a FastA file, which is right here, and create a new file that I will save as code and I will call this getRNDBFasta.sh. And copy those lines in there and again, I need my shebang line. User bin env-bash, and you can see that Adam is very friendly in providing the shebang line for us so we perhaps don't have to remember it and this works great. And I can again say author, patch loss, inputs, none, outputs, and this is going to output the RRNDB, 5.6, underscore 16s underscore rRNA.fastA into data raw. chmod plus x, code, getRRNDB that, if I look in my code directory, I see that those are both executables now and so I can do period forward slash code, getRNDBFasta.sh, it was already there so it didn't retrieve it, it extracted the file, we're good to go. All right, so we'll keep trackin' along and what I'm gonna do is because this is gonna be very similar for the other programs is I'm gonna copy this, open up a new script and then save this into code as getRNDB and I'm gonna get the tsv file, tsv.sh and this will download the tsv into data raw and I need to change this to tsv and this to tsv, okay. All right, and so I will then chmod that or my LSLTH to make sure everything looks good. Oh, in code and I see that they're all executable, okay. So I'll do code, getRNDSB, tsv.sh and I think maybe I screwed something up here. Just double copy those over just in case. Yeah, so they had 16S RNA, FastA, so we don't want that. So sometimes copying and pasting isn't a good thing. Okay, so let me rerun that, that works great. So we could keep doing this. Perhaps what you notice, the only difference between these two files is the file name. I notice I've got the P on the opposite or the NC on the opposite side over here for one of these, that doesn't really matter but really the only thing that's different between them is the file that I'm downloading and that seems kind of silly to have four or five scripts that all do the same thing and the only thing changing is the file name. So what we're going to do is we're next going to learn how we can bring in arguments from the command line to make our code dry, D-R-Y, which is short for don't repeat yourself. So if you again compare these different scripts, I'm repeating myself an awful lot. And the problem with keeping with not being dry is say I wanted to update one of the arguments or location perhaps, perhaps instead of saving it to data raw, I decide I want to save it to data R and DB. Well then I have to update all of my scripts. That's a real pain. If I have one script to download all these files but where I change what I'm passing to it, then it's much easier to maintain that code. So the variables that come in from the command line are assigned a value or a variable name of dollar sign one. And so we could then do, I'll say file or I'll say archive equals dollar sign one. Now to demonstrate what's going on here, I will go ahead and say echo quote dollar sign archive. So when I run this, it's not going to download the files. All that's going to output is the file name I give it. So I've got get R and DB TSV.SH. So I can do code get R and DB TSH. And let me give it a file name. So I'll say data raw R and DB, dot TSV. And all it should do is say output this file name. And sure enough, that's what it's done. So again, we can use that variable now archive, wherever we want to insert the archive name. So like right here, we could replace that with archive and it will output it. So two things I want you to notice about how I called it. So first is that there's a dollar sign before archive. That tells bash that it's a variable. The other thing is that you don't necessarily need those quotes for it to work, but it's safest, right? So if you happen to have a space in your file name, then those spaces are going to wreck havoc with what's happening. Alternatively, you might think, well, maybe I'll just use single quotes. And so what you see happens with single quotes is that instead of putting the variable in for dollar sign archive, it outputs dollar sign archive. So in these cases, we generally want to use double quotes. Okay, so I'm going to uncomment those lines and wherever I have the file I'm downloading, I'm going to put in double quotes, dollar sign archive. And I'll put that there as well. Very good. I can get rid of that. I will save that. And so now if I run this, it does everything. Ah, it didn't work because I gave it my local path rather than the name of the file or the name of the archive. And maybe I'll put in zip and that should work. And so that works. But you know what? I don't really want to put in zip. I want to put in the name of my local file because that's going to be easier for me to remember. So what I can do is I can add dot zip to the end of these, save it. And now if I put in the name of the TSV file, that should work. And similarly, if I put in the names of the fast day files, so 5.6 dot fast day. Okay, so I'm getting an error message there and I think I've got the wrong file name. So maybe that's the archive or the name of the file inside the archive. Let's see, if I do LS data raw, I see, nope, it's 16S RNA dot fast day. So I'm giving it the wrong file. So again, if I do 5.6 underscore 16S underscore RNA dot fast day, that works. And similarly, if I did RRNDB hyphen 5.6 underscore pan taxa, underscore stats underscore NCBI dot TSV, that works as well, very good. So I've been using this code, get RRNDBTSV dot SH. I also have that other one that I created. So I want to give these perhaps better names. So what I'll do is I'll first get rid of my get RRNDB fast day file, move that to trash, and I will rename this to get RRNDB files. And the inputs are the name of the file extracted from the archive without the path. And that then outputs the appropriate data raw. And I think we're good. So I'll go ahead and save that. And again, I LS my code, LS dash LTH code. I see that I have get RRNDB files, get silver seed, and both of those are executables, and I can test this again. But you know what, I've already got the stuff in there. So that's really not that hard to do. If I do LS data raw, why don't I go ahead and delete my TSV files and then see if I can regenerate them. So I'll do RM data raw and data raw.zip. Now let's do the test code, get RRNDB files, and I'll get RRNDB-5.6.tsv. It downloads it, extracts it. It's good. If I do LS data raw, I see my TSV files are there. So again, the advantage of what we've done is two things. So first, instead of having to go into those README files to copy the code into the terminal to run them, we've got them in a script. And if I want to get the files, I run the script. It's much more convenient than having to go into those README files. Also, because we had those repeated lines, right? So if we again look at the data raw README, we have four files we're downloading and the commands are the same except for this chunk, right? And so in the principle of dry, don't repeat yourself, we created one function, one file, one bash script that does this, but where we feed into it the name of the output file, which is pretty snazzy and again, makes it easier to maintain the code. So before I forget, I forgot to commit and push my changes. So again, get status. I need to get rid of that hello SH. I don't want to keep it around. So I'll do RM hello.SH, get status. I'll go ahead and get add both of these. All right, and I'll do get commit. I'll do automate downloading RRNDB. And so the files very good. I'm on branch eight. As it says, I'm gonna go back to master merge and issue eight. So get check out master, get merge issue eight. That looks good. I forgot to close the issue, but that's okay. So we'll go ahead and do get push. And now if I come back to my issue tracker, again, I can check these off and come back and grab my commit, where'd it go, right there. And I'll do closed with that. You'll see that it knows that that was the commit message and automatically formats that as a URL. So I'll go ahead and close and commit. And we've checked off another issue. So as always, I have a couple of exercises for you to work through to practice your skills that we just covered in the tutorial. So the first asks you to create and close an issue to write a script that installs mother. So you'll need to borrow a code that we used from a previous episode that was in code read me. The installation script should be stored in code and retained under version control, but we don't want the code in mother to be under version control. And so that should still be protected in our .getignore file. We also want to double check that running this from the command line works before closing the issue. And so I've given you some mother code here that will work without causing any problems. And so you don't need to know mother to be able to run that line of code. The second problem, the second exercise, I ask you to work on and close issue four. So you're gonna create an align sequences.sh script that runs that line that you just did for practicing to make sure mother was installed in the right place. Run it from the command line to make sure everything works. Go ahead and add star log file to your .getignore file so that you're ignoring the mother log files that are created. Again, pause the video, work on it on your own and then come back and press play and I'll show you how I resolve those problems. Another issue we want to create is installing mother. So I'll go ahead and say install write script to automate installing mother. And I'm pretty sure those instructions are in my code. Read me. Well, the URL is in there. So I'll go ahead and copy that and say this information was in code. Read me. So we're gonna work on issue nine. So we need to create that branch. So I'll do get branch issue nine. Get checkout issue nine. Says that we're on branch issue nine, nothing to commit working tree clean. We're in good shape to get moving on this issue tracker. I'm gonna go ahead and copy that link and go there. And you know what? I'm gonna double check that this is the most recent release because there may have been a new release since we first did that. There is now an issue of version 44.2. So I'm gonna use that. And I'm using Mac OS X. You use the one that works for you. And if I go ahead and copy that link and I will create a new script that I will save in code. I will save this as install mother.sh. And again, I'll paste that down in there. So I'll do my shebang bash. And I'm gonna double check what I had before with get rndb files. I'm gonna just copy those. I'm not gonna use this exact code, but I don't wanna reinvent the wheel here. So I do wget hyphen p code mother and see that. And then I will do unzip hyphen n hyphen d code mother. I'll do code mother forward slash this. I'll get rid of those lines. I'll save this. I'll make sure that my install mother is executable. And then I'll do code install mother. So it looks like it's doing stuff. All right. So it's complaining that some of the stuff exists. So let me do LS code mother. And I see some of these other things are in here. So I'm gonna go ahead and delete some of these to get rid of them. So I'm gonna go and do rm hyphen rf. So that is a recursive force removal of mother. And I'll do code forward slash mother. And let me rerun that, see where it goes. Sometimes this takes a couple of iterations to get it to work. LS code mother. So it put mother into mother. That's not exactly what I was hoping for. So let me try this one more time. I'll get rid of that. And I will get that there. Okay. And I'm gonna have it extract into code. And I think this should work. Again, we get this figured out once. We don't have to worry about doing it again. And it has put it, I can see into mother in the right place. Everything looks good and we're in good shape. So we wanna test it and I will copy the code from the blog post that a company is list to make sure everything works well. And mother's clearly running, so it's in the right place. Everything is running. This might just take a few moments. All right. So that took about seven or eight minutes to run on my computer. There's a warning message that comes out at the end that was expected, tells me that there were sequences that need to be reversed. We got the reverse complement because these sequences were pulled out of genome sequences. Some are going in a four direction. Some are going in the reverse direction. And we use that flip equals true argument in mother's line seeks function to get them all into the right orientation. That's another benefit of aligning our sequences is to make sure they're all going in the same direction. Very good. So we have mother installed and something that occurs to me is that I need to go ahead and put an author and inputs, none, needs, yeah, none, no inputs. And then outputs, mother installed in the code mother and some notes to say that the zip archive contains a directory called mother so we can extract it directly to code. Very good. And we also want to update our readme because this is now version 42. I will say that code forward slash install mother.sh installs mother. Another thing that I forgot that we needed is W get as another dependency from a few issues, few episodes back. So again, save that, save that, save that. We want to modify our dot get ignore. We see that code mother is already being ignored, which is good. And we also want to put in star log file, to save that to ignore those files. And you see that they just turned gray as I saved that. And so anything in add them that's grayed out like that is being ignored. So I'll go ahead and come back to my terminal. Do my get status. Everything looks good there. So I'll do get add dot get ignore readme and then code install. Get status, get commit, provide script to install mother to correct location. Very good. Get checkout master, get merge, issue nine. And I forgot to put the closes in again. Maybe I'll remember for issue four. And then we can do get push. And I will grab this code and come back to my issue tracker, issue nine. And I'll say closed with that. And everything pushed up well. So I'll go ahead and comment and close the issue. So again, we're in good shape. So the next thing we want to work on is this issue four. So again, I'm on master, right? So I will then do get branch, issue four. Get checkout, issue four. All right. And so again, for this issue, what we want to do is we're gonna use align seeks to align the sequences in R&D BFASTA to the Silva reference alignment. The code that we wanna run was what we previously ran in mother. So if I up arrow, I should get back to that. So I'll go ahead and copy that. And I'm gonna paste this into my issue for now that this was the code we wanted. I'll go ahead and copy this. Don't need to copy everything. And I will create a new script. I'll call align sequences. So this will be code align sequences.sh. And that's the code we're gonna run. And again, we need our shebang to do user bin ENV bash, author patch loss, inputs. So the inputs are gonna be two of them. So that and my this, it'll put these on separate lines so they're easier to read. And then outputs, it's gonna be this align, I believe. We'll have to double check that. Okay. Also, point out what we saw earlier was we need to include flip equals true. To make sure all sequences are pointed in the same direction. So I think that will be good. I will go ahead and make it executable. So chmod it. So code align sequences.sh. And now I should be able to do forward slash align, code align sequences.sh. It'll run, again, this will take seven or eight minutes. And then we'll be ready to close out the issue. Everything ran, we got that one warning message, but that was telling us that it had to flip the sequences. So we're in great shape. We'll do it, go ahead and get status. And I'll add code align sequences. Commit alignRNDV to SylvosSequences. Seed reference alignments. All right. So get checkout master. I stopped myself. So we can edit our commit message if we have not pushed. So I can do get commit hyphen hyphen amend. And I will then say closes number four. And then I can save that, exit out. Phew, I caught myself. Again, to amend or change your commit message, you can use get commit space hyphen hyphen amend. That will then open up nano or whatever editor you're using. So get checkout master. Get merge issue four. Very good. Get push. And we see that we've closed the issue. So we're back down to one issue now. Very good. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. Perhaps you could experiment with converting some of your common data analysis processing steps into a script. I'd love to see how you are adapting when I have covered in this and other Code Club episodes. Also, please let me know what types of data analysis questions you have and I'll do my best to answer them in a future Code Club. Finally, please be sure to tell your friends about Code Club and to like this video, subscribe to the channel and click on the bell so you know when the next Code Club video drops. Keep practicing and we'll see you next time for another episode of Code Club.