 I believe you can answer your own data analysis questions. Do you? You should stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence to ask and answer questions about the world around us using data. In the last episode, we converted the notes we had been keeping in our README files into BASH scripts. This was a great step forward towards having an automated and reproducible analysis. You'll recall that we created scripts for downloading our reference files and files from the RRNDB. In the exercises, we wrote scripts to download and install mother and to use mother to align the sequences from the RRNDB against the reference alignment files. This was great because we could call the scripts from the command line and everything would be done for us. From my experience, the challenge then becomes remembering the files and scripts that downstream scripts and files depend on. For example, I anticipate that the RRNDB might update its database this fall. What else will I need to run if I want to update the RRNDB files in my project? Well, our sequence alignment depends on the sequences from the RRNDB and the script that downloads the RRNDB files. So I would have to run the download script before running the script that aligns the sequences. Ask yourself, what else would I need to run if Silva or mother puts out a new release, which they likely will before the project is done? This is a lot to keep track of, and our project has only just begun. Therefore, we need a tool that will help keep track of all of these dependencies. A few years ago, I was at a conference on reproducible research method, and I heard Carl Broman, a statistician at the University of Wisconsin, give a talk. He said that make is central to ensuring the reproducibility of his work. That really got my attention. This was especially true since I had only ever used make to install software from source code. But the more I thought about it, the more using make made sense to me. Make, or more officially, GNU make is a program that keeps track of dependencies. The developer, that's you, writes rules that tell make what files need to be created. That file is called a target. The rule also indicates the dependencies or as are called prerequisites to building the target as well as the recipe for generating that target. Thinking back to the example from the last episode, the target is the alignment file generated from other. The prerequisites include the R&DB FASTA file, the Silver Seed reference alignment, and the script that ran the alignment, and the recipe involved running that script. If the target file doesn't exist, or if any of those dependencies are newer than the target file, then make will rerun the rule to generate the dependency. But what if the script to download the R&DB file changes? If I tell make to generate my alignment, I will see that the download script is newer than R&DB FASTA file, and it will then re-download the R&DB FASTA file. This will then trigger rerunning the script to generate the sequence alignment. GNU make is installed with bash, so it's available for people using Mac OS X, Linux, and Windows 10 bash system. I know GNU make, so I'll be using that in our analysis. An alternative is snake make, which also seems pretty powerful and popular. But when I was trying to learn it, I got frustrated just installing it and gave up. So perhaps down the road, I'll revisit snack snake make, and we can do a head-by-head comparison. For now, I'm pretty happy with make and think you will be too. As I go forward with this project using make, we'll become baked into our workflow pretty much like get flow has become. As I mentioned, when we install software from source code, we typically enter make install. My project, my goal for this project isn't to install software, rather it's to generate a manuscript that I can submit to a journal. At the end of this project, I hope be able to write make manuscript.pdf, and it does everything from downloading my data all the way through generating a PDF version of my manuscript and its figures. If we can do that, then we'll have great confidence that our analysis is reproducible. But it would also be great to rerun this analysis with a new version of the RNDB, silver or mother. This would allow me to see how robust my analysis is to updates in the database and these other files. Not only is reproducibility great for transparency and getting things right, but it's also great for improving the rigor of our work. Given the growing popularity of Amplicon sequence variants in microbiology, if I'm going to critique their use, then I really wanna make sure that my analysis is rigorous and that anyone can reproduce what I'm doing. But first, I need to show you how to use make. In today's episode, we'll ease into the use of make and in future episodes, we'll see how we can improve the sophistication of our use of this great tool. Be sure to like this video and subscribe to the Riffimona's channel so you know when future episodes are released. Even if you don't have a clue what Amplicon sequence variants are, I'm sure you'll get a lot out of this episode. Please take time to watch today's episode, follow along on your own computer and attempt the exercises. Don't worry if you're not sure how to solve the exercises. At the end, I'll provide some solutions. If you haven't been following along but would like to, welcome. Please check out the blog post that accompanies this video where you will find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the show notes. I'm over in my GitHub repository, my account pschloss, and then schlossrnanalysisxxxx, 2020. Again, that xxx is where we ever end up submitting this to a journal. That's where the journal title would go. That's my convention scheme. Anyway, let's go over to issues and we're gonna file a new issue for converting and creating a make file. So I'll click new issue and I'll say create make file. And we wanna put the following targets into our make file. So we will do the silver seed reference alignment, the RNDB FASTA file, the RNDB TSV file, the RNDB NCBI taxonomy file, the RNDB RDP taxonomy file. And we also wanna put in our motheralign.seqs script to generate aligned RNDB sequences. So that's our task for today. We won't do them all during the main part of today's episode. I'll save some of them for you to do in the exercises. So this creates issue 10. I'm gonna go over to my terminal and navigate over to my project. So we'll do CD documents, floss RND analysis, LS in there. And I'm gonna do get status to see we're on the master branch. And I see that I've got a few leftover mother log files from a previous analysis. These just kind of take up space. I have get ignore ignoring these. I did that in the exercise from the last episode. For now, I'm gonna go ahead and remove these. I can do RM mother and I don't have to type out the names, the full names for everything are copy and paste them all. If I have a pattern, I can do mother.star.log file and that star will match those three different numbers. So that if I run this and LS, I now see there's gone. If I do get status, it's not complaining because I'm ignoring those, right? So normally if you delete a file that's being tracked by version control by get, it'll tell you that you deleted a file. So we're in good shape. I'm gonna go ahead and do get branch issue 10. Get checkout issue 10. And we're switched to branch 10, issue 10. And again, if I do get status, I see I'm on the right branch. I'm gonna go ahead and fire up atom so that we can create our make file within atom. So I will do atom. I'm gonna go ahead and create a new file and I will save this called make file. And this is gonna be in my project root directory. So again, we need a file that starts with a capital M and then lowercase ache file. This makes it easier to use make to have this file called make file in our project root directory. There's other ways to call, there's ways to call other make files or files that contain your make rules. But this, like I said, is the easiest. So I'll save that. And we see now in my file tree on the left side here that the make file is stored there. Great. So again, as I said in my introductory remarks, we're creating rules. And so a rule consists of a target and then a colon followed by prerequisites. So prerequisite one, prerequisite two, maybe prerequisite three. And then in the next line, and inserted with a tab, so I'll go ahead and write tab is our recipe. So this structure is what we're going to use for all of our rules that we're going to create today. Now I've put these in as comments and you know they're a comment because they start with the pound sign. Same kind of thing we saw before when we were making bash scripts in the previous episode. The first rule that I wanna create is to generate a rule and recipe for downloading my silver seed alignment. And so because I can't remember the path and the full name, I always like to come back to my terminal and to use LS to get that name. So I can do data references, silver seed, and type LS and I see the file there. So that's great, it lists out what I want here because I don't only wanna copy one thing, I don't wanna copy the path and then also silver seed v138, I'm sorry, this one, dotted line. What I can do is I can add a star to the end of my LS line and that will include the full path, the full relative path to where I am, which will give me this file. So I can come over here and I can put that as my target and I can do the same thing with the code directory. So I can do LS code forward slash star and I see what I want here is code get silver seed dot sh and I can then go in here and I can set that as a prerequisite. I then indent over a bit and I can then do period forward slash code, copy that in and that will then be the recipe for using this dependency, which is the script itself, to generate that target and I save it and so it's important to save your make file before you try to use it. I can then do make space and then I can put in my target, which was right here, run that and I get an error. So this is a very common error. I know exactly what happened. The error is six missing separator stopped. So if I come to line six, which is the six, there's an error and the problem is that I put in spaces rather than a tab. So go ahead and replace that with a tab, save it and then rerun it and everything works. Now, this kind of surprises me that it ran because we had already run it last week and everything seemed fine. So what we can do is we can use that LS-LTH that we talked about last time to understand why it's retriggering to run the rule. I can do LS-LTH code and I see that getSilva seed was generated on July 20th at 1017 and if I do LS-LTH data references, I see that my Silva seed align was from March 4th. So it's using the timestamp from what it downloaded as. And so what it's saying is that the script getSilva seed is newer than the target, right? So the prerequisite is newer than the target. So it says rerun the rule. So that's not really what we want. So what we're going to need to do is modify our script here, I'm going to go ahead and modify getSilva seed SH so that it uses an updated timestamp. A way that we can address this issue is to use an argument with tar that we use to extract the files, which was the dash M. So I'm going to go into getSilva seed SH. I'm going to put it before the F because the F is referring to this archive file. And so I don't want to put the M after the F because I'll probably get an error. So I'll go ahead and save this. And then I'll come back and rerun my rule. Extracts, everything is good. Let me redo LSLTH data references Silva seed. And you'll see that sure enough, my align file now was created on July 22nd, which is the day that I'm filming this. And so that now is newer than the script LSLTH code. And you'll see that my align file was created at 12.45 and the script was created at 12.44. So it should be in good shape. So if I rerun this make rule, it says everything is up to date. This is beautiful, right? The only time this isn't beautiful is when you change something, you rerun it and then problems happen. Generally it's the problem that I forgot to save the file that I was modifying. Very good. We have our first rule in our make file. Let's go ahead and get our R&DB FASTA file. So again, we'll do LS data raw star. And that will give us this file, the 65.6 16S RNA FASTA. So that will be our target. So that's my target. And the dependency is gonna be in code. And this will be get R&DB FASTA files. And again, tab, period forward slash run that. And the file that I need to give in an argument because you'll remember that last time we also talked about using variables that we can pass in to the script. And that is going to be the name of the file. So this now, if I make this, should download and put into the correct place our R&DB FASTA file. So I'll do make my target. And it says nothing to be done. We're in good shape. Let me just double check again, LSTH code. My get R&DB files was created on July 20th. And if I do LSLTH data raw, I see that my FASTA file, where are you? Right here was created on November 8th, 2019. So it seems a little bit weird that that rule didn't work. And so sure enough, I see the blue dot and I think I forgot to save my make file. So I'll go ahead and save that and rerun it. And it rerun through everything. So now I can look and do data raw. I'll do star FASTA. It's make it easier for me to find the file. So this was again created November 8th, 2019. Again, we have the same problem as we had with the silver seed file, where our timestamp of the FASTA file is older than the code file, right? And so we see November 8th, 2019. And for our code for get this is July 20th. And again, we can see that after we did the extraction, LSLTH, data raw, we're still looking at, where'd you go? We're still looking at files from November 8th, 2019. So what we need to do is we can use a function called touch and we talked about touch probably on the first episode of this series as a way to create a file. Well, if you touch it, you're also modifying it or updating the time on that command. If I did touch data raw, RRNDB, RNDB, 5.6, underscore 16S, FASTA. And then I rerun my LSLTH. I now see that that file has a timestamp of July 20th. So I need to add that touch into my script. So I'll go ahead back to get RNDB files and I will add to this at the end, touch data raw, forward slash quote, dollar sign archive, right? And now if I rerun my make, it'll run it and it's already there so it doesn't retrieve the zip file but it then pulls it back out. If I run make again, it says it's up to date, which is great. And so again, if I do LSLTH data raw, I see that my first file from 1257 is my FASTA file. And if I do LSLTH code, I now see that get RNDB files. It has the same timestamp but we know that this was created or saved before we saved our FASTA file. So we're in good shape. Okay, so we see we have to do a little bit of tinkering or what's called refactoring of these shell scripts to get them to work well together. Now we have, let's go ahead and do the next step of generating our alignment of this FASTA file. And again, we've already created that and we see it, but if I do LSLTH data raw star align, the file that ends in align, we get this path. So this is gonna be our new target, a colon. And now we have two dependencies, right? So the silver seed is gonna be a dependency. Actually we're gonna have three dependencies. And what you'll see if I make my font a little bit smaller that these kind of fit on the same line now that if I have too many dependencies or my dependencies names are too long, I can put a backslash at the end of the line to continue that. I can then tab over to make it look kind of nice. Maybe I won't go quite so far. And I can add my other dependencies. Again, including a backslash to complete the dependency. And I also need my third dependency. I misspoke there's three, which will be the alignseeks.sh, but that's in code, right? Very good. So I'll save that. And then I will call a period code, forward slash align sequences.sh, save that, make sure I saved it. I'll go ahead and copy this because we wanna generate that target. And now if I do make that, it complains, right? So it says no rule to make target code align sequences.sh needed by data raw, stop. That looks right to me. So if I get an error like this, it's typically because of one of two reasons, either there's a typo or because I've created dependency, but that dependency doesn't exist or there's no rule to create that dependency. In this case, there's a typo and really it's the same thing, right? So this file doesn't exist anywhere in code, but also there's no rule to create it. And the problem is that there's a typo and that's why it doesn't exist. So if I remove that S, save it, now come back and run this, it'll go. Now I would encourage you to not run this on your own and let me do it because this is gonna take about eight minutes to execute. Well, wonderful, we've gotten through our alignment. And again, if I run make on that alignment script, then everything is good and up to date. So something like this took about eight minutes to run and sometimes some of our other analysis steps might take even longer. If we're not sure how long things are gonna take and we wanna double check what else gonna get executed when we run the rule, we can use a special flag with make, which is dash n. So I do make n, make dash n on my script. It then says everything's up to date, we're good to go. Now what I'd like to do is demonstrate again what all happens and when we perhaps lose a dependency or update a dependency. You don't necessarily wanna run all this on your computer because it's gonna take too long and it's kind of the same mother output that you've already seen. So I'm gonna remove my silver seed file. So I'll remove data references, silver seed, align and I'll also remove my data reference, update a raw, rrndp, underscore 16s, fasta. All right, so those are gone now. And if I do make dash n on my alignment file, we'll see what all needs to get run or will run, right? And so again, we see that it's gonna run getsilva seed.sh. It's then going to get my FASTA file. It will then run alignment sequences. And so if I run this now, without the dash n, I see that it gets the silver seed, it got the rndp files and it's again coming into align seeks. Again, this will run for eight minutes. I'll skip ahead by editing. So that's executed. Let me go ahead and do another make dash n on this to double check that everything is good and it tells me that everything is up to date. This is excellent. Again, to kind of illustrate what was going on. If I do lslth on my target, that was made at 120. And if I do lthdata raw fasta, that was made at 112. So that precedes the other dependencies. That was also at 112. And then lslthcode align sequences was a couple days ago. So we see that our target is newer than the prerequisites. Otherwise, make would trigger itself to rerun that line of code and to regenerate any dependencies that were also out of sync, if you will. A common question is how finally to cut your targets? How many targets should you have? When should you have make generate a target? So a couple of factors come to mind. So first of all is the question of how long does it take to generate the target? Where am I getting the target from? For example, this silver seed aligned file, I'm only gonna use probably to generate the aligned version of the RNDB file. But I don't wanna keep pinging the server at silver over and over again, whenever I generate that alignment file. Through testing and validating things, I might run this a dozen times. I'd rather not have to ping their database, their website a dozen times. Same goes for the RNDB. And so if I can download the file once, have it stored, I'm good to go. A related concern is if the target takes a long time to generate. You might have a series of commands or programs that you wanna run and perhaps really all you want is the output. One of those steps early on might be finicky or might be really big. And if I perhaps introduced a bug further down the pipeline, then every time I run that pipeline to test it, I have to redo that finicky or a large computationally intensive step. So if I can pull that target out as its own target, I can generate it once, I have it stored, and then I can proceed with the rest of the pipeline. Again, that is related to this download issue, but instead of downloading, we're building something that takes a long time to generate. Also, if you have a file that's gonna be the dependency of a bunch of other rules or more than one rule, then that really should be its own target. Revisiting the issue tracker, we've made targets for the Silver Seed reference alignment, for the RNDB fast day file, as well as running align seeks to align the RNDB sequences against the Silver Reference Alignment. I'm gonna save the rest of these for your exercises. As always, I have a set of exercises for you to work on. After I describe these to you, go ahead and pause the video, work through these on your own, and then once you've worked through them, go ahead and press play to see my solutions to the exercises. So the first question asks you to run this make statement, which looks like a pretty good make statement that would work with our make file. To make this target, you'll get an error message, okay? So go ahead, get the error message, and then figure out what the error message means and what the problem is and how you can fix it. The second question, our alignment script only has one line. What are the advantages of putting that line into a bash script versus in the body of the recipe? Finally, I ask you to finish and complete issue 10. So I hope you found those exercises engaging and helped you practice your skills a little bit more. This first question asks you to identify an error message and then see what you need to do to resolve the error message. I'll go ahead and paste this in. So make data references, silver.seed, v138, down the line. Looks good to me. Run that and it says, no rule to make target data references, silver.seed, v138, down the line, right? So you might see the error immediately. I have errors that are frequently very obvious, but I can stare at them for 15 or 20 or a couple of hours and not see them as being so obvious. So let me go back and look at my make file and see if I can forget what the error is. And I see, sure enough, I forgot a directory in the path. I forgot the silver seed directory in this path. So if I come back and after references, I add silver underscore seed, forward slash silver seed, v138, down the line. Everything is good and we're up to good shape. The second question asked, why do we have this one liner recipe for align.seeks? And as I mentioned before the break, there's a couple of reasons why this is a good idea. So one, the align.seeks step might be computationally intensive. It took about eight, 10 minutes to run on my computer with 16 processors. Maybe you don't wanna sit around for that, to go through that every time, as you're perhaps testing a larger workflow. And so it makes sense to pull out that computationally intensive step, build the target so you don't have to worry about recreating it every time that you're testing your workflow. Another reason to have it cast aside in a script is that that script then becomes a dependency. And so if I say add a argument or remove an argument or I might add a line to the code, to that script file, then it becomes a dependency. And because that dependency has been updated, it will then trigger make to go ahead and rebuild that target. Versus if I only put that one liner in here in place of my bash script, that's not being tracked by make. If I modify the arguments or the different commands that are being run in the recipe here, then it's not going to trigger make to update the target. Unless again, it becomes one of the prerequisites. The third exercise asks us to finish building out the rules for issue 10. So we need to get RNDB, TSV, and the two taxonomy files. So these are gonna be very similar to what we did up here for getting the fast day file. And we will again, copy that if I do LS, data, raw. I see my TSV file here. So I'll copy that as my target. And my dependency is also going to be that shell script. I'll copy this down and replace the fast day file with the name of the TSV file. I'll go ahead and make this. Maybe I'll do make dash N to see what it's going to do. It says nothing to be done. And sure enough, there's nothing to be done because I forgot to save my make file. Remember that if the target has already been made and is up to date, then you will get a message that says it's up to date. Nothing to be done is not the same as up to date. So nothing to be done for this typically means that there isn't a rule to build that. So I need to go ahead and save that. Then my blue circle went away. I rerun it with make dash N. It tells me it's gonna go ahead and run this. I'm cool with that. It runs it, we're good to go. And again, if I do make dash N again, I see that it's up to date and we're in good shape. So be sure you save your make files. So that's the TSV file. Now we need to do the NCBI and RDP taxonomy files. And those are, here's the NCBI file with its path. I will make this the target. And I will bring this down as the file name that's getting fed into the shell script. And while I'm here, I'll go ahead and replace the NCBI with RDP for the second rule. And I'll save it so that my blue circle goes away. And I'll go ahead and try to make this again to test with the NCBI file. It's gonna run that script. Great, that's all good to go. And if I do the same thing but with RDP, that goes and that's good to go. And again, if I try to make it again, it tells me everything is up to date. So that closes out those topics. So I do get status. I see that I've got a make file that's not being tracked as well as the two shell script files that are modified but need to be committed. So I'll go ahead and do get add code, get RNDB files, code, get silver and then my make file. Get status, very good. They're all staged and ready to be committed. Get commit dash M, create or automate pipeline with make file closes number 10. So again, let closes number 10 will tell GitHub that the issue has been closed and to go ahead and close the issue. So that's good. Just to make sure everything worked. I do get status, nothing to commit, working tree clean. I'm on branch issue 10. Get checkout master, get merge issue 10. That's great. That's great and we're ready to push. Now coming over to GitHub, I closed the issue with that commit and we're good to go. Going back to the issues, we're back to our one issue of trying to accumulate different resources about Amplicon sequence or ants. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. How about easing into using make by writing a rule to do a step in your current project. I'd love to see how you're adapting what I have covered in this and other episodes of Code Club. Please let me know in the comments below what type of data analysis questions you have and I'll do my best to answer them in a future episode. Finally, be sure to tell your friends about Code Club and to like this video, subscribe to the Riffamonus channel and click on the bell so you know when the next Code Club video drops. Keep practicing and we'll see you next time for another episode of Code Club.