 Hi there and welcome back. In the last tutorial, we discussed using literate programming tools like R Markdown to blend our narrative text for a manuscript or lab report with code to improve the transparency and reproducibility of our analyses. The ongoing idea I've been developing throughout these tutorials is the idea of a write.paper command that will start with an empty directory and go all the way through our analyses to produce a final manuscript that is ready to submit. With the material from the last tutorial, we're basically there. We could go ahead and add the necessary code to our driver script file and then recreate our files. One of the problems with that script though is that it's not very smart. Consider for a moment that if instead of using the RDP taxonomy, we instead use the silver or the green jeans taxonomies. We'd have to either rerun the entire script or work out which files have the reference taxonomy as an upstream dependency. Today we'll learn to use a tool called make which keeps track of when a file was last updated as well as the other files that depend on it. It's kind of a strange tool but despite many efforts to recreate something like it, make still persists as a dominant tool for building software packages. Instead of software packages, we'll see how we can use it to keep track of a data analysis pipeline. Make will also allow us to deepen our thinking about the reproducibility of our analyses. We can write a function on make to burn our project to the ground and then we can use another function make to rebuild it. We've seen this before with the analysis driver script and we'll re-emphasize that principle today. Join me now in opening the slides for today's tutorial which you can find within the reproducible research tutorial series at the rifamonus.org website. Before we start talking about make, I want to give you all a pop quiz that will jog your memory from a few sessions back and help prime what we're going to be talking about today. Noble's paper on organization of computational biology projects was something we've discussed a few times in previous sessions. I have this bulleted list of I think about six different things that Noble points out as being really important for organizing and going through a project. How many of those can you remember? So maybe pause the video and scratch your head, maybe look back through the previous notes if you have to and then come back once you've got your list. Great. So this was the bulleted list that I've shown a few times to record every operation that you perform. When we talked about this in the driver script comment generously. Again this goes through all of our work whether it's our driver script or our files or our readme files. There's never enough documentation. We don't want to edit our intermediate files by hand. We want to be able to start the script, walk away and never have to touch anything. Use a driver script to centralize all processing and analysis. Again we want it to be automated. Use relative pathways and again we've discussed we want to make this relative to the project route. Finally the bullet point that we've kind of left the side until now until today's tutorial is to make the script restartable. And so that's what we're going to talk about today in using make files and how we can adapt our driver script file into a make file that will allow us to make that script restartable. We'll differentiate between different rules targets and dependencies when building a make file. We'll generate and use variables within a make file and then we'll learn to take on a make clean make mindset to help us to build a reproducible workflow. So again really the focus of today's tutorial is taking this driver script where we centralized all our processing analysis and making it a bit smarter as I said earlier. We want to make the script restartable so that it can keep track of time stamps on when files were last modified or created and what other files depend on those files. So there's a few case studies. Imagine that we've finished making our analysis driver bash and we're happy that we've automated our analysis. But consider the following scenarios. So first the mother developers have put out a new release of mother. It's always a possibility. We put one out last week actually we're up to version 1.40 I think. There might be a new version of the RDP classification database. Perhaps we decide to generate the green jeans reference taxonomy or perhaps we have decided to change the coloring scheme for our ordination instead of black and white. We want to make it I don't know orange and gray. Finally perhaps we've been persuaded by our PI to generate an animated vision of our figure. Okay so for each of these we might use our analysis driver bash file in different ways right. We could start everything all over again from scratch but that's gonna take a while. But alternatively there'd be a better way that we could restart up that script from the desired step right. So if we change the mother executable then perhaps we shouldn't have to change our downloads right and so forth. And so let's think about that initial example of that the mother's mother developers put out a new release of mother. What all would we have to change? And so if we think about the various steps in the pipeline well we have to regenerate silver seed align. This explicitly depends on mother. We'd have to rerun get good seeks dot batch. Get error batch. Get shared OTS batch. Get MMDS batch. We'd have to rerun our R code and generate the figure. This is what we might consider an implicit dependency. So to run the R code we don't run mother. We don't use mother but we use files generated by mother right. So the output of get shared OTS dot batch goes into get MMDS batch and then that data that was generated by mother is then run in R. And then finally we'd have to regenerate manuscript dot PDF. Again mother isn't used to generate manuscript dot PDF but information that goes into manuscript dot PDF was at one point generated by mother. Let's think about if there's a new version of the RDP classifier. Well we'd have to download the new version of the database so that's an explicit dependency. We'd then have to rerun get good seeks batch because there's a step in there where we remove sequences that are say chloroplasts or mitochondria. We also might update our taxonomy assignments for each OTS. We'd have to then rerun get error batch because again in there is how we it's affected by how we've perhaps removed sequences based on their taxonomy. We'd have to rerun then get shared OTS batch and get MMDS data batch. So each of these last three commands are again implicit dependencies because data that's generated upstream of that is being affected by having a new reference taxonomy. And again we'd have to regenerate our R code in our figure and rerun manuscript dot PDF. So say we wanted to update to use the green jeans reference taxonomy. Again it's all the same things as changing the version of the RDP reference but instead we're going to be using the green jeans reference. So we want to modify the coloring scheme of our ordination. Okay well that was a kind of a later step in our R script. We wouldn't have to rerun all those previous mother commands because the mother commands don't depend on what we use for a color scheme. So it would be considerably simpler to generate a new version of the figure with a different color scheme than to say change the reference taxonomy that we're using. Say we've been persuaded to generate an animated version of the figure. Okay well to do that we'd have to write and run a new R script and then we might have to regenerate manuscript dot PDF. Okay so we might not have to do that because maybe that figure wouldn't go into the manuscript because there aren't too many animated figures in a manuscript. So this is what we call dependency hell. All of these dependencies are difficult to keep track of and yes you could do it by hand or by tracing your finger back through the files but it gets really complicated and this is just hellish. The situation only gets worse as the analysis becomes more complicated. You can imagine these dependencies as being kind of this like tangled web. We could also we could take our analysis driver script we could make it smarter using if statements and checking time stamps using various bash commands but that's all kind of a pain in the neck. So what are we going to use instead? We can use make. Make is a tool that was developed in 1970s by Stuart Feldman when he was a summer intern at Bell Labs. I think Bell Labs is pretty remarkable and I suspect Stuart Feldman was a pretty remarkable individual to have a tool persist for the last 40 years from a summer internship. I don't know what you were doing for your summer internships but I certainly wasn't making a tool that would last that long. So the goal was to have a tool for compiling software. People generally hate make. It's really it can be frustrating to use. I'll try to make it as simple as possible. So there's a variety of knockoffs that are trying to do what make does but slightly easier to use perhaps but again it's been going for 40 years and none of the knockoffs have taken over so it's clear that it's hard to do much better than the original. So then the primary use case is from software engineering where we want to recompile software but we don't want to have to do the entire thing so if we change you know to make mother we might have 200 pieces of 200 files that get compiled to make that one mother file. Well I don't have to I don't want to have to recompile everything that's in there if I'm only changing one file. I only want to change the things that depend on that file. Our use case with a data analysis is that I perhaps don't want to rerun or redownload files that require a lot of time to generate so I only want to rebuild the files that depend on that change. So my motivation is that I'm going to repeat various steps and analysis many times. I might be tweaking different parameters. I might I might do a thing where I change my reference taxonomy right. The end product is going to be a data heavy paper written in our markdown as we discussed in the last tutorial. There's going to be steps in the analysis that are slow. I don't want to have to do a lot of hand holding and there's going to be a lot of these steps and so I want it to be as automated as possible and because the projects get complicated it's frequently hard to keep track of all the dependencies. Also when I do this I'm going to run it on a high performance computer like Flux here at University of Michigan or as we've been doing on Amazon's cloud. So this needs to be scripted as much as possible so I can use something like Tmux or a scheduler to fire off the job and walk away. Also I want to make it possible for others including me to replicate what I've done. So to get us introduced to make we're going to take on a fun example stolen from 538.com where they had a story about predicting the age of individuals based on their name. So if I told you my name was Patrick, can you predict how old I am just based on my name? And so here's an example of them doing that with American girls named Brittany by year of birth and so if somebody's name was Brittany you might expect that they were born around 1990. Here's the distribution from their paper among the 25 most common boy names. Patrick doesn't show up here but if you had a boy let's say named Anthony you'd expect that the median is about 28 years old. So my reaction to reading this I love baby names I have a bunch of kids and so I'm always wondering how my kids fit into this and so I wondered what about my kids names? What would I predict their ages based on their names? And what about my own name? What about my wife's name? And so if you know about R there's this package called shiny which I could imagine this analysis being really cool as a shiny app where you could plug in any name and get something out. This analysis was done in 2013. Well what about if we did it today 2017 2018? It was also for kids in the United States. Well what about kids from Canada? And then ultimately can I replicate their analysis? I want to understand how they did their analysis and can I get that get their work to be replicated. So this is a great opportunity for thinking about reproducible research. I really admire 538 because they've done a great job of making a lot of their data and code available on github. So it's great that they're doing open science as you can tell if you look at their data repository it's kind of a garbage can. It's all there. You just have to dig a while. It's not really well organized. And the script files in here, this most common name.r file doesn't actually generate the figures from the paper. So that's a bit frustrating. So I did it for them and this is what we're going to work on. And so the idea is that we're going to be able to go into our home directory, run git clone to download my version of their analysis as an analysis driver script. We're going to move into that directory and then we're going to run that and generate the files. So this is the idea that we have this driver script and it does the analysis. And so what we're going to do is we're going to convert this baby name analysis.bash file into a make file as an illustration of then what we're going to do with our Kozic data. So we're going to go ahead and log into AWS to do this analysis. So we're logged in. I'm going to go ahead and do git clone. HTTPS, right? Clone down. CD into my make tutorial. LS. And again, if we wanted to we could look through here. Let's look at our nano baby name analysis bash. And you can see I've hopefully commented generously through here. I've listed my dependency, what things are being produced, my R code, the various commands I'm running to do all this can log out of nano. If I look in data, I've got my processed and raw. There's nothing in processed. There's one CSV in my raw directory. And this is a file that indicates the number of people that were alive in 2016, I guess per 100,000. So the other thing that I have in my notes is that my code requires the zoo package. So if I go into our I do library zoo, there's no package called zoo. So I need to install the zoo package. So I'll do install dot packages, quote zoo. So that's that's installed now so I can quit and clear my screen. And I should be able to run this by I'm gonna by running the bash command. So I'm going to go ahead and do tmux because I forget how long exactly this takes. And we can then do bash, baby name, analysis dot bash, hit enter, and sit back and let it roll. So that was pretty quick. Let's go ahead and open up file zilla. And I'm going to go into make tutorial. And I've got family report.html. And voila. So I did this for my kids names. Patrick, Joe, John, Jacob, Peter, Ruth, Mary, Martha. I have another kid that I forgot to add Simon. Maybe we can do that as part of this. And so you can see the number of people that were born in each year named Patrick, as well as the number of people today or 2016 that were alive. So I have a son named Patrick, my name is Patrick. This line here, 76, right about as when I was born. And so we can we can see the that Jacob was this this solid blue line is the median. Jacob was born in the 2010s. So that's kind of close. And I love this day, I think it's kind of cool. Martha's was born four years ago. So this gets her quite a bit wrong. And so you can see my version of their plot here, where you have the ages of the different children as long along with the interquartile range for each child and the median age, median year or age, I guess, that we would expect them to be born. Okay, so that's pretty cool. Let's think about how we would add Simon to this analysis. And what that would do to the overall flow of our project. So again, this is the driver script file, where again, we have good documentation and comments. And then we have a W get, which pulls the data down from the Social Security Office. We unzip that and remove the data to the right place. This is similar to what we've already done with our Kozich data. We then run an R script that concatenates the annual baby name data. We then fill in the missing data from the annual survivorship data to interpolate the mortality, because I think it came out every five years or a couple years. And so we need to interpolate the intervening years, then generate counts of total and living people with each name using an R script. And then we run our markdown to render the family report.rmd to pick that to generate that final HTML file. So if we look at this as a diagram, we can see that the goal is a family report.html file. But to generate this depends on family report rmd plot functions dot r. It also so this is code. It also depends on data total and living name counts.csv. And so this then is dependent on an R script and other data file, which is also dependent on that mortality data that we already saw in the data raw directory, as well as another R script. It's also dependent on other data, right? And so we can see that if we change one of these, if I say change or delete all names csv, I'm going to have to regenerate this because it is going to need to be needed here and it's going to be needed here. Whereas if I just generate plot functions dot r, I don't have to worry about all this other stuff. I only need to worry about regenerating family report dot html. If I say change the code in interpolate mortality dot r, then everything on this side of the network is going to be is going to need to be updated. So what we'd like to do is similar to what we had done previously. I want to be able to clone the repository, do cd make tutorial and then write make family report dot html. And it just works. And so as we've discussed, a make file is a series of rules, as they're called. And these rules contain commands for processing the processing the dependency to create what are called the targets. So we have commands, we have dependencies, and we have targets. So the target is the stuff on the left of the colon. So the target here is data processed all names. The dependencies, all the stuff to the right. So code can catenate files dot r. And then this dollar sign parentheses, yob txt parentheses, these are the dependencies. The commands then are what follows on the following line set off by a tab. So this r dash e source, blah, blah, blah. That is the command. And so it's important to note cannot be emphasized enough. I make this mistake frequently. The second line is set off by a tab not spaces. So this is a tab character, not spaces, and you'll get an error if you put spaces in there, it will complain. So what does this rule allow us to do? Well, I can generate data processed all names dot CSV with this make command. So what we're going to do is we're going to convert our analysis driver batch file into a make file. So looking at this, how would we write this as a rule? And note that if you have really long lines, you can extend the lines with a backslash. Well, we could take this and say that the target is family report dot html. The dependencies are family report rmd code plot functions that are and data process total and living names dot CSV. And then the command is down here. So I'm going to do some copying and pasting. And so I'm going to copy this. I'm going to create or open my make file. So I do nano make file. It's an empty file. I can paste it into here. This now is my make file. So if I save that, I can do make family report dot html runs missing separator. That is the error when there are tabs. So if I go back in here. Yes, as you can see, I'm inching over. So I need to put a tab in there. I exit make family report. It says it's up to date. So if I were to go in and do nano family report dot rmd. Let's say I say Schloss family name report without poor Simon. So I have now updated one of the dependencies for family report html. I can again do make family report dot html. And it renders that step that will run that step that it now makes family report dot html. So the next rule in our analysis driver bash file generates a file that contains the counts of the total and living people with each name. And it depends on these three files, two sets of data and an R script. And it then produces data process total in the name counts. Go ahead and pause the video and see if you can convert these dependencies target and code as a rule for make. So again, what we might expect is that on the left side of the colon, we're going to have data process total and living name counts dot CSV, we're going to have a colon, and then we're going to list these three files. And then on the next line, our command is going to be this R call. And sure enough, that's what we see here. So I'm going to copy this and paste it into my make file. And this is upstream of family report dot html. So I'm going to put it above that. I don't think the order necessarily matters. I'm going to put it replace those bases on the command line with a tab. I don't think the order necessarily matters except for my own readability. And speaking of readability, I'm going to make this a little bit wider so that the lines don't have funky breaks. That's one of the problems with these really long file names. So we can save this, quit out. And I can then do make data processed total and living name counts dot CSV. And it says it's up to date, we're good to go. So the dependencies that we have in here are these three files. So if we were to if we were to change get total and living name counts dot r, it would rerun this and it would also rerun our render for the R markdown file. So let's give that a shot. So if we do nano code, get total and living name counts dot r, I'm going to remove this note to myself just to have something to to edit, save that quit out. So if I now do make family report dot HTML, clear the screen so it's easier to see. We'll see that that first step, it went by pretty quick. The first step was to regenerate that file. And then to regenerate family report dot HTML. Okay, so again, it's keeping track of these dependencies for us, which is really powerful and really nice. So we're getting the hang of this. Here's a third one. And what I'd like you to do is go ahead and pause the video, make the rule to generate the data process alive 2016 annual dot CSV file. Go ahead and paste this into your make file and see if you can get it to work. Okay, so again, convert what we had an analysis driver bash. Put it into your make file, make sure it's properly formatted. And then go ahead and run the make command, maybe change interpolate mortality dot r, and, and make sure that everything works the way you would expect it to. So hopefully you're able to add this rule in to generate the alive 2016 annual dot CSV file. If we save it, quit out, we do make family report dot HTML, everything is up to date. You can get a long ways with using make by making these relatively simple rules, where you again define your target, your dependencies and the command you want to run. The next step gets a little bit harder because we need to generate a list of a large number of dependencies, because we're going to get data from all sorts of years from 1900 to the year 2015. And we're going to require some bash to do that. And some of the built in functions for make that will allow us to generate file names. To break down these three lines for us shall seek, which we see here generates numbers between 1900 and 2015. So it generates a vector of numbers between 1900 and 2015. So if we're to look at the value of years, it'd be 1901 1902 1903 all the way up to 2015. We could then have year of birth at YOB as a file where we will add prefix data for where we will add raw data raw YOB to each of the years. If we look at data raw, we see that we've downloaded from names dot zip. All of these files of YOB 1880 and so forth. We're starting at 1900, because I think the data before 1900 are kind of spotty. You also see that we have 2016.txt. I think the data have been released since we initially I developed this tutorial. Okay, so what we're trying to do is generate a path that's data raw YOB 1902.txt. So you can see what we're trying to do is build that path that goes data raw year of birth YOB and then the number. And then we can use a another make command called add suffix that will add dot txt to the end of all the values in YOB. Okay, so we create the years 1902, 2015. We add the prefix of data raw YOB to each of our years. So this would give us something like data raw YOB 1923. But then we need to add the suffix for that text file. So we'd have data raw YOB 1963 dot txt. And that then would be YOB txt. That would then be a dependency that we have for some of our make commands. And so here we have the rule to generate data processed all names dot CSV has data raw YOB the year text files. So we can replace this with our variable from over here, which would be YOB txt. To do this, we will have our dependency, I'm sorry, we'll have our target, all names CSV. We will have code connect concatenate files dot r. And then we will define a variable, kind of like we did before, which was at the dollar sign parentheses, and the name of the variable inside of it. So the dependencies here are actually quite large. It's this file, as well as all of the YOB from 1900 to 2015 dot txt files. And then it's going to run the command source code concatenate files dot r. And this is going to look for those year of birth files. Copy and paste this in. So we're going to copy this information into our make file. So I'm going to go ahead and copy the comments because I forgot to put that down. And we'll get the rule here. And I like to have some white space between my different rules so it's easier to read. And I can save this. So I can do make family report. Oops, family report HTML. And it's complaining about a separator again. See, I do this a lot. Replace those spaces with a tab, quit, make, and it tells me it's up to date. So we're good to go. Other rule that we have that you can hopefully do again is to convert this where we downloaded the data into a rule. And so this is going to be kind of a weird rule, because we're generating targets, which are those YOB files. But there's no dependency. Right, this is one of the downsides of make is that there's no way and make to say, hey, go look at this file up on the Social Security Administration website and see if it's been updated. And if this has been updated, then regenerate, redownload the files regenerate the report. So you have to be kind of kind of to do some things yourself. So this is the rule that we'll get. We'll have YOB TXT as the target, no dependencies, but then these are the commands. So here, instead of having one line of a command, we instead have two lines of a command. So I'm going to copy this into my make file. And there we go. So if one of my YOB TXT files is deleted, it's going to say, well, so say I delete YOB 1976.txt, it's gonna say, well, I need that here. It's going to come back to this rule. And it's going to then download all this information to re obtain that text file. In the process, as you can perhaps imagine, because all those files are contained within names.zip, it's going to get all of those files as well. So that's not a big deal. But not that big. This is the way the data are stored online for us to use. So save, quit. I got to remove those spaces, save and quit. And then do make family report.html. And family report is up to date. So in my Schloss family report, it says at the bottom here, if you want to use other names, you will need to comment outlines 23 and 24 from can catinate files.r. This will cause the scripts to run a bit slower. So because it would normally generate these types of data for everybody. I limited it to just the names of my kids. But but what I'd like to do is add Simon to this and see what that does to the processing of our data. And so if I go into nano code, concatenate files.r. And I look in here. So my kids name, sorry, in here as a vector. I'm going to add Simon. And so now I've changed this file concatenate files.r. So I quit out and I go into make. I see that I've changed concatenate files.r. So this is going to cause this rule to run. So data processed all names dot CSV. And so if we scroll down, we might say, Well, where else are data processed all names CSV used? So here it is. So it'll run this rule then. And then it will also run this rule. So if we save back out, and we do make family report.html, run that you can see it's rerunning concatenate files.r. It's then rerunning the report. And if I go back into filezilla, reopen this, I don't see Simon in here, but that's because I never modified my my report. So I need to then go into and modify now family report RMD. And I will generate a time plot. He's a male and data. And I will want to then also add Simon down here. He's a male. And I need to update everybody's ears a little bit. So we'll do this. Just actually 13. John will be 10 next week. So I'll put 10. Let's see. He's 6434. Martha's two and Simon is let's call him point five. We'll quit out of this. And we'll then make family report.html. And we have Simon. Oh my gosh, is Simon really becoming that much more popular? I'm shocked. So that's pretty cool. Simon still kind of a rare name. And you can then see the data here for Simon has been added to my plot. And this is again, cool, like this was what I thought about when I read the 538 article was where do my kids fit in here? Where do I fit in here? Where does my wife fit in here? And so we've been able to first replicate their analysis. And then, as if you will riff on it by converting our driver file to be a make file. And that's pretty cool. I didn't realize that Simon had gotten so popular as a name. So some common errors that we see frequently, we already saw this because of my copying and pasting, that if you get this error that says make file star star star missing separator stop, this is generally because you use spaces instead of a tab key to inset your commands, replace those spaces with a single tab, and you should be good to go. You might get a error that says no rule to make target family report RMD. See, it's missing a period needed by family report HTML stop. This generally means that you mistyped the name of the dependency. You might also get something where you say make family report HTML again, missing a space. And this will say no rule to make target family report HTML stop. This means that you mistyped the name of the target in all likelihood, or that the rule doesn't exist in the make file. So here is what our make file looks like at this point. Again, each chunk represents a different rule that we are using to specify the dependencies, what we're generating and the code or the method the command needed to go from those dependencies to those targets. One of the things you might notice in looking at these variable names or dependency names, is that there's a lot of repetition, right? So we have data processed shows up frequently. And so what we might think about doing is it'd be nice to make this a little bit more dry. Because if I instead of putting it in data process, maybe want to put it in data slash R to denote that that's output from R or something else, then then I'd have to update all these all these all the way through. And so, while that might be a bit of a silly example, there are ways to create and use variables as we've already seen with the years and yob txt variables. And so we can make our code a bit more dry by defining two variables, raw and processed. So we might say that raw is data raw, processed is data processed. And then we can use these much like we already did down here with yob, where we insert years with a dollar sign parentheses for years, right? So we could replace data raw yob with dollar sign raw yob, because maybe we'll move where data raw is, maybe it'll become data underscore raw, instead of having a subdirectory within data of raw. Okay. So if we look at our make file, we can create a raw variable, raw equals data, raw, processed equals data processed. And so down here, we have data raw, we could do dollar sign parentheses raw. And we could do up here. So we could go all the way through. And we could update our code to use these variable names instead of the paths. And so up here, we could define our paths. And so if we need to reorganize our project, we could use that. And again, we've already seen this a bit with the years and the yob txt. There are some several built in variables within make that we can use. And so these get a bit confusing. And I admit that I usually have to look these up, or I need to have a little cheat sheet that I keep by my computer. And so if we have it a star ampersand, that denotes the target. If we have dollar sign, carrot, that's all the dependencies for the current rule. So the target, this could be represented as a dollar sign ampersand, or at signed, I guess, that all the dependencies in the current rule would be dollar sign carrot. The first dependency, so in this case, code concatenate files dot r would be the first dependency. And then there's something a dollar sign star for pattern match. So how would we write this instead? So instead of having code concatenate files dot r, we could use dollar sign less than instead of code content concatenate files dot r. So you see we've simplified the code because again, the fear is that we might change the name of the dependency and then it would come into here. It makes it a little bit tidier. At the same time, I don't try to get too carried away with this, because it does obstruct the readability of my code. Because again, like I told you, I frequently forget what these dollar sign variables are. But if you're looking at other people's make files, you'll frequently see these automatic variables. And this is the jargon that's used. These are called automatic variables because they're built in to make. Similarly, we can also use a dollar sign star for pattern matching. And the percent sign matches a pattern. And so this is a bit of a weird example. So if you type make print hyphen, and then say y OB, that percent will match the Y OB. And then echo is a command that will use the dollar sign star to say echo equals, and then the value of the echo variable because it'll put echo here to replace the dollar sign star. And then it'll come out and let's say, okay, this is a variable. Let's print out the variable. And so if we type make print y OB txt, we see all of our Y OB txt files. So again, if we do make, I need to add this to my make file. So nano make file. I generally put a command like this or a rule like this at the top, because it's kind of a utility function. And I like to use this as a, as a resource to help me as I'm building more complicated variables like this Y OB txt. So print colon tab in echo dollar sign star equals dollar sign, prens dollar sign star, prens quote. So if I do make print Y OB, I get these data raw Y OB 1900 all the way up to 2015. And so again, I can then say, okay, that works. Now let's add the txt suffix. And now I see all my paths. So I like using this print rule to build more complex things like this Y OB txt variable. Sometimes if it says it can't find a certain variable, I will use this to see what it's spitting out. And so sometimes it might be blank. And that tells me that I've got a problem in how I'm building my pattern, but it is a bit of a more advanced move that you might not want to worry about right yet. Another type of rule that we can make is called a phony rule. And so a phony rule is a rule that it doesn't really have a target. And so here we're saying dot phony to define it as a phony rule colon clean. So we'll have a rule called clean that has no dependencies. It doesn't actually produce anything. But it will delete a whole bunch of files. It'll generate it'll delete all the files we've created. And so I'm going to go ahead and copy and paste this into my make file. I like to put this rule at the very bottom. And this will burn down the project. So if I save that. And if I look in my directory right now, I see that I've got my family report HTML. If I look at LS data, raw, I've got a bunch of files there. But I do make clean. It's complaining. I forgot to remove my spaces. If nothing else, me forgetting this all these times will remind you to use a tab instead of spaces, make clean. So we see that family report HTML is removed. If we use LS data raw, those files are all removed, right? So we've burned everything to the ground. And what I love about make is now the ability to go make family report dot HTML, fire that off, and it works. Right? And we can again, look in filezilla, see what it looks like. And we've got the updated data and it all works. And it looks beautiful. So there's some other commands that we can use with make, we can use make dash n and the name of the target. This will tell us what commands will be run to build the target. Make dash D and the name of the target will tell us what dependencies need to be fulfilled to build the target. Make dash J will allow you to use multiple processors, where you define a target. And then it puts your different dependencies on different processors. typing make alone at the command line will build the first rule in a make file. So I like to specify the actual target I want to run. And if make knows to look for capital M make file. But if for some reason I name out make file something else, I can do make dash F my make file. And this will then use my make file in place of your name to make file. So some general design thoughts, so to speak, I think using scripts are better than command line calls in the make file. So the make file isn't a dependency itself. So say the SSA website changed. Well, I could change it here, but that doesn't mean it's going to regenerate yob txt those files because there's no dependency here. But if I put this all into a script and then made this script a dependency, then if I updated the code, then it would regenerate yob txt because the dependencies are newer than the target. If you make a script, try to make it specific to one function. This will minimize the scope of the dependency tree. So if you change one script, you'll affect fewer downstream files. So for example, if you're doing an analysis, and you put all of your R code into one file to generate, say five or 10 different plots, and to run all different statistical analysis, if you update that file, then everywhere that you call that file is going to run. But if you make a separate file for each figure in your paper, then updating one figure isn't going to cause you to regenerate all the other figures. And so again, I like this because it keeps things more compartmentalized and easier to work with. Some other resources in case you're interested, include software carpentry, they've got a lot of great resources. And one of the tutorials they have that I really like is on make. That's basically where I learned to use make. GNU make is the reference for for using make, and then Stack Overflow is oftentimes pretty useful too. Make can be confusing, it can be frustrating. But again, all this stuff with the variables, the automatic variables or the variables you name, those are all extra, they're not that critical. If you long hand write out the names of your dependencies and your targets, you'll be in great shape. It's kind of I find that when I try to go to the next layer and think about things like variables that I cause problems for myself. Alright, so we're not here to study babies in their names. We want to turn back to the Kozic analysis. So in the project directory, there's already a file there called make file. And so what I want you to do is see if you can identify the variables. Can you see the rules for the bash commands you put in the analysis driver bash file? There's some things already in there because again, this came out of the template for a new project and a lot of the things are pretty common. And so what type of things need to be edited for this specific project? And what is missing? And so I would encourage you to either adapt this file for the Kozic project project based on what you learned from the 538 name tutorial, or delete the file and start over. And remember that writing make dash n allows you to do a dry run and see what happens. Okay, and so as you go through the Kozic file, recall that when you're getting the fastq.gz files, don't delete the tar file, that that perhaps could be your, your target for that download. Instead, put it in data raw and make it that target. We can use the tar file as a dependency in the next step. Okay, and so you might ask, why isn't this ideal? And that's something for you to think about. And so as you go along, I want you to think about make dash n on on various targets along the ways and what all needs to be done. And then finally, if you remove your silver seat align and you write make dash and write paper, what is going to happen? nano make file. And so at the top here, you'll see some variables, rafts, figs, tables, proc for processed final for submission directory, our helper function for printing. There's also three things in here that I'm going to go ahead and remove. I'm going to comment out and encourage you to do the same because it seemed like a good idea at the time. But they just seem to cause problems. So I'm going to delete those out of sight out of mind. And so we look through here, we see downloading of our reference files. Here, we're using the RDP train set up here, we've got our silver reference alignment, how to run the data through mother, we have our basic stem and the file names get quite long. So we can use basic stem to replace that if we want. We also then have get good seeks dot batch, along with the dependencies. And again, here are our get shared o to use that batch and notes that what you need to update, which we've already updated when we were doing our analysis, driver batch file. And this then is getting the error. And there's also a part in here for figures, and then also for building our manuscript. So excellent. So we need to add a few things. So we need to get the data into here. So I'm gonna come back and exit out. And I'm going to do a nano analysis driver batch. And we want to get this chunk here for our stability, w meta g dot tar file, nano make, I forgot to save when I quit before. So I'll remove those files. And if we come down to the top of part two, I'll put it in here. And the suggestion was to make data raw stability, w meta g dot tar, our target, it won't have any dependencies that and instead of removing stability, w meta g tar, we're going to move this to data raw. So what's gonna happen is it's going to download stability meta g tar to our home directory, our project directory. It's going to untar it and throw the data into data raw. It's going to then move our tar file into data raw. And so then if we run it again, it'll see that that is there. So if I do, I'm going to copy that because I always forget names within like three seconds. So the other thing that we want to bring in is our figure or ordination. So if we nano analysis driver batch. So down here, construct our MDS PNG file. So we'll copy that into our make file. So that was down here in part three, formatting gets kind of wonky. So that's good. All right. And so the output, we need to know the target and the dependency. Right. And then this is the code we're going to run. So one dependency is going to be code plot and MDS dot r. The other is going to be this horrifically long name. And we saw up above that we can replace everything through pre cluster with our our base file, our base variable name. So I need to double check what that was because, like I said, I'm forgetting. And so if I move up, basic stem basic stem parentheses. So that's that. And we can then also replace this horribly long name. That's the other advantage of the variables that if we've got these really long file names, we can simplify them a bit with using stems of variable names. Okay. And so get that working. And so our dependency was this so we can take that out. The other thing we could do is we could reverse the order of these dependencies. So that instead of having to write out the long name, we could use the automatic variable. I'm not totally sold on the automatic variables, as I mentioned, because I frequently forget what they are. I'm also forgetting what the name of our target is. So I'm going to LS, LS results, figures, and MDS figure PNG. So it's results figures, and MDS figure PNG, go back to my make file. So it is dollars, results, figures, and MDS figure PNG. Down in the manuscript section, we need to add the dependencies. So if you recall, our RMD file inserts a copy of our ordination figure. And I can copy and paste the name of that file, just because I'll I always forget these things. So results figures and MDS figure PNG. And also, we brought in a data file to our code chunk. So let's go back to our manuscript RMD submission, submission directory. And let's grab this shared file. Because this is a dependency for making the manuscript. So we fly back down and pop that in. And again, we want the backslash there. And recall that we can replace all this jazz with the stub, the basic stem. So these are some extra dependencies for rendering our manuscript. Also down here in right dot paper, where we want to make sure that we have all the things we need in order to submit the manuscript, we're going to go ahead and delete these files. And we're going to add in results figures. See, I forgot. And MDS figure dot PNG backslash. And so we should be good. So back here, if we do make dash and write dot paper, make circular submission manuscript or MD. So submission, manuscript or MD dependency dropped. Again, that's not a big deal. If we look at our make file, again, go to the end. So to run rate type paper, we need all these RMD files. And so it uses that percent sign to match any of the extensions. So it's trying to build manuscript or MD, which I can't do because it's got itself as a dependency. That's a circular rule. And so that's not allowed. And so it's not going to build that, which is fine by me. But it's going to build a markdown file, a tech file, and our manuscript PDF file. Okay. And so again, we see the commands it's going to run, which are the commands for right dot paper. So we do make right dot paper. It's running through it's building all of this. And everything is good. So if we do make n right dot paper, we see there's nothing to be done. It's also complaining about that dependency. And if we do LSLTH submission, we now see that our PDF or tech or MD files are all newer than our armwork down bib tech and CSL files. So if our RMD was newer, it would rerun and regenerate those. Okay. So there's a couple of things that we could do to make this better, we could add a rule to download mother, we could make a mother we could make mother a dependency on all of these. Or we could, like I said, we could change the reference database we used. And other things what I'm going to do to prove to ourselves that this works is I'm going to look at those data raw. And we see all those GC files, I don't have a phony, I don't have a clean rule, you could you could make a clean rule. I'm going to do RM data, raw, GZ data, raw, tar those data raw. And we see that but that's pretty empty. So if we went ahead and we ran make, then everything make right dot paper, it should work. Right. So before we do that, let's go ahead and you know what, I'm going to go ahead and add that rule that mother rule to my I'm going to go ahead and add that mother rule from my analysis driver bash to my make file. And so that's this first line here. And so exit out nano make file. And I'll put that at the very top here. And again, the formatting gets kind of funky. And the dependency is going to be code. I'm sorry, the target is gonna be code mother. There's no dependency. We need to tab these all over. And again, what we could do is we could make not there but here, we could make code mother a dependency. Right, and we could do that for all of these as well. So what do we do now? Well, let's get let's commit this. We've done a few things to modify our project. We do get status. We've modified our make file. We've modified MD and PDF. I go ahead and get add make and submission manuscript md and submission manuscript PDF. Get commit. We're going to transfer and also get RM analysis driver bash. And get commit is transfer from analysis driver script to make file. So that's committed, get status, get push into our credentials. Excellent. So we're going to burn this thing down. So I'm going to do RM dash RF. Kozic. Oh, that was scary. So now we want to prove to ourselves that we can clone it. So we get hub, psh loss. Kozic reanalysis a m 2013. I'm going to clone this copy that. Get clone cd into Kozic LS LS submission. We've got our manuscript files. But if we LS data raw, there's nothing there. Right. So what we're going to do, and this is awesome, is we're going to make write dot paper. And so before you hit enter, make sure you're in tmux because this is going to take a little while to run. We'll hit enter. And it complains. No rule to make target data mother stability. Ah, because we forgot a rule. But you know what? We deleted it. So we deleted our analysis driver batch file. So what we want to do is let's go back and use our history. I perhaps got a bit aggressive in deleting that. So I'll go to my previous commits, where we had not transferred from analysis driver script to make file. And I see that I deleted my analysis driver bash. And somewhere in here, I get an MDS data batch, we can copy those lines. I'm not going to worry about get subs data dot batch, because we didn't use any of that data in our make file. So if I do nano make file and come down, I guess I should have done the make test before I did everything else. But that's okay. So maybe I'll put this with my fate figure and table generation. This is the rule that depend the target is this long beast. And the dependency for that is our shared file, which is up here, right here. And we can bring that down. This becomes our dependency. And then that is our command. Quit. And again, if we try make write dot paper, it's firing away. So to get out of tmux, we can do control BD. And we're back to our make tutorial directory. But whatever if we do cd tilde, Kozic, we can sit tight, and knowing that we can go back into tmux a and watch it at any time. And sure enough, we see another error, mother not found. And that's because our make file doesn't know that mother is in code mother. So this would be a great variable to make. So mother, we will do, we'll do code mother. So I think we gave it the wrong directions all the way around. It's supposed to be code mother mother. And so if we because the executables in the mother directory, and we can replace this code mother with dollar sign mother. And I'm going to copy this to make it easier to copy and paste throughout. I'll go ahead and make this a dependency here. So it looks good. Just looking for cases of where we're calling the mother executable. There is one. Again, we can add this as a dependency. And I think there's one more. So we can do a search of you control w for search, you can type mother. And we can look for cases of the lowercase mother, where it's not supposed to be. So I'm going to replace this with our variable name. Looking good, like we got all the cases. I think we've got it. So save that quit. Now we'll do make write dot paper. And we'll just watch it and it'll it'll rip. So it looks like we got an error here, where it doesn't like my use of wildcards in tar. So I'm going to go ahead and enable these wildcards in my tar, you know, make, I'm gonna look for train set, control w train set, save this and make write dot paper. And what it told me to do use, maybe have to put the wildcards before the SVCF. Hard time seeing what's going on. So if I make that, I'm not sure what's going on. I'm going to delete train set. Let's try again. So I'm just going to remove this wildcard thing is getting kind of cute. So we delete that and try make write dot paper again. This is real life folks. So basically what happened was that before I had this tar gz tar x gz where I'd use a wildcard to only pull out the files I wanted. And so I for some reason, the version that's on Amazon of tar doesn't like that. And so I went in and I removed that that star at the end that you saw. But now it's complaining because I have an RMDurr to remove this directory. But it's not empty. So we need to modify that. Instead of RMDurr, I'm gonna do RM-RF. We can do RM-RF that make. So I typed make. And the first rule, like we saw ran, but that's not what I want. I want make write dot paper data mother. There's a bunch of log files in there. Let's get to this. So unable to find any gz files in your directory. Is that because it hasn't downloaded them? I think that's what it is. So if we do LS data raw, we don't have any gz files. So we don't have our dependencies straight. And so we need to tell it what to do. So basically, we set this as a rule as a target, but we never use that as a dependency. So we need to add that as a dependency. So we will come down here. And we'll add that as a dependency. Save that, we will make write dot paper. And now it's going to download the data. So just because we make a rule doesn't mean it gets called. So we have to use the target for that rule as a dependency elsewhere, or we have to make it explicitly for it to get used. And so this is going to take a while to go. But I'm confident that it's going to work this time around. Well, after a few hiccups that finally worked through and we got to the end, we can also do LS-LTH. And we see that in submission that our files have been generated and opened. I can go into FileZilla. And have to go back up to the higher level to my cositary analysis, submission, manuscript PDF. And similar to what we had in the previous tutorial, we see that we've got our paragraph in here. Our plot is embedded. And it all looks great. So there were a few hiccups along the way, of course, in getting that to work. But I think that also underscores the value of doing the make clean and then make write dot paper again, to make sure that our ducks in a row and everything works and that it's reproducible for us. If you're if you're seeing this tutorial, or if you started the tutorial series, after April 25, you probably have a set of code for the new template that won't have the problem, many of the problems that I ran into. I've gone back and I've updated the code that comes in the template so that it works a little bit better here on AWS than what we experienced. Regardless, I still think that's really valuable for people to see how we troubleshoot what's going on with our code and why we get things that don't work. And ultimately getting it to work for us is is making it more reproducible for others, right? If it's reproducible for us, it's going to be reproducible for others as well. So let's go ahead and do get status. And we've got our updated files. Something we might do is we could do get diff word submission, manuscript md. And we see that there's a small the only difference between the two files between this run and the previous run is about 17 sequences on average. And this is probably coming through because of the randomness in our chimera checking and classification, a few of the other things. So we will go ahead. And again, we did the get status. We'll do get add make file. Something weird here happening with the data references read me, I think it's written over what we had before we didn't change anything there. So I'm going to check that out and revert to the old version. So do get check out hyphen hyphen, the name of that file, get status. And then we can do get add make file results figures and MDS figure submission, manuscript, MD submission, manuscript, PDF submission, manuscript, tech, those have all been staged, get commit. And we will say, what should we call this? What did we do? We modified the make file to get it to work workflow. Great. And now we can push it enter our credentials. And excellent. So again, what we've done is we've demonstrated to a pretty good degree that we can burn down our directory, we can clone it from GitHub, we can then write make write dot paper and voila, we can recreate it. So I'm going to go ahead and exit out of tmux, exit out of Amazon. And put that and then we're going to go into Amazon. And we're going to stop our instance. You have the time in the interest, I'd encourage you to go back and do a small experiment. And that would be to go back into your make file, and change what data set you're processing. If you recall back from when we pulled the data off of the mother website, to get that tar file, there were two options. One was six NS data that was sequenced with shotgun metagenomic data. And the other was without shotgun metagenomic data. So go ahead and modify your make file, and trigger it to generate a new to rerun that rule that you modify, and then rerun it and see what happens. Of course, that's going to take an hour or so to run. And and at the end, you can run get diff word submission manuscript md and see what changed in that in those sentences. As a parting thought, I'd really like you to think about the philosophy that we just exhibited, and had a little struggle with but got it to work of make clean make that that process of again, burning down our directory of deleting it, pulling it down, or deleting all the processed files we made, and then rerunning it back through to make sure that is reproducible is really a great philosophy to work by. If you if you approach a project knowing that you are going to intentionally destroy your project, and rebuild it, you will change how you approach the project, how you approach your documentation, how you approach your scripting and your coding and your automation. And that's going to be really invaluable. Whether or not you use make that philosophy is really important, right? We, we haven't done anything new here with make the main difference, of course, is that we used that we use make because it keeps a timestamp it keeps track of dependencies, we could have just rerun the whole script like we did previously. But, you know, we had make and so that was a little bit more sophisticated. And other projects get bigger and more involved, we perhaps don't really want to burn down the whole project and start over again. Also, don't worry about all the variables and fancy make stuff, get it to work and then refactor from there. Like as I mentioned, like some of the variables like the automatic variables, they're not easy to read. And if it's not easy to read, it's not going to be ultimately reproducible or transparent to other people that come along. So I'd also like you to think about this exercise, that the analysis and writing of the original Kozich paper was done using a more, what you might call traditional method, where we had a bunch of scripts running things. We didn't really publish the scripts. We tried to be transparent in our writing of the manuscript. But as we saw from the number of reads that got through, we weren't sure if that was the number of raw reads or processed reads, millions of sequences, right? But I did then go back and rewrote the paper to make it reproducible using the methods we've been talking about in this tutorial series. So if you look at our GitHub repository and see the full project was structured in terms of documentation organization, think about how we used automation with make and how versioning was used to track our progress in the rewrite. So how does the rewritten manuscript compare to the published version? And what could be improved with the documentation of the rewritten manuscript? And then what would be the next steps towards reproducibility that you feel comfortable taking? You might start small with make. In general, when I write a make file, I'm not repurposing an analysis driver script. I start with one rule. And then I add another rule when I'm ready to do the next step, and I keep adding rules as I'm doing subsequent steps. So here what we've done is a little bit artificial, because we've refactored previous code to generate what we've done, both with the baby name example, as well as with the Kozic analysis. Well, think about instead, starting with a blank file, where you're slowly adding rules to build up your analysis. That also is another way of approaching it that that might make things a little bit easier to understand what's going on. Well, that was a lot of material, but it was really important. Next to version control and literate programming, make is one of the three most important tools I have in my reproducible research toolbox. The ability to write make clean make write dot paper is a great feeling. Maybe just a little bit better than writing get in it. Last year, I wrote a paper in and bio about the use of preprints among microbiologists. It was a commentary that had some data in it and a few plots. This manuscript was using written using the same methods I'm outlining in this tutorial series. Over the past two years, as you may know, preprints have really grown in popularity. Well, during the time that the manuscript was in review, and I was making my edits, a lot had changed in terms of the number of preprints that were being posted. Before my final submission, I was able to use make write dot paper to refresh the project and have bleeding edge data in the paper. I did this all with minimal interaction with the code. It was pretty awesome and underscored the value of using tools that facilitate the reproducibility of our analyses. In a couple of months, I'm going to be giving a talk at the micro meeting on preprints among microbiologists. And I look forward to refreshing the project again, using make write dot paper to update my figures and update the statistics that I reported about the use of preprints among microbiologists. In the next tutorials, we'll be discussing the tools that we've already been using to help foster collaboration. In the next tutorial, we'll use a process called get flow to collaborate with ourselves. In the subsequent tutorial, we'll discuss the use of pull requests of other resources hosted on GitHub to foster collaboration with others. I look forward to talking to you then.