 Hey, have you ever used make to keep track of your dependencies in a data analysis pipeline? It's my go-to tool for ensuring the reproducibility of my analysis if you've been following along past episodes of code club You've seen that I've been using make to describe recipes for creating intermediate files in my analysis What we've done so far has been pretty basic stay tuned though because in today's episode. We're gonna take it up a notch Hey folks, I'm patch loss and this is code club and each episode of code club I present various concepts that I use in my own research to improve the reproducibility of the analysis Please be sure to subscribe to the channel and hit that bell icon so you know when the next episode is released Two big concepts and making data analysis reproducible are keeping your code dry and automation Don't know what dry code is. Don't repeat yourself in the last episode We created a shell script to create Amplicon sequence variance or ASVs for different regions and distance thresholds The script was smart enough to figure out the region and the threshold from the name of the file We asked it to produce that file is also called the target But the problem is that we may end up wanting to run that code 50 times We're going to copy the same make recipe 50 times changing the target each time. No, that wouldn't be dry Today we'll see how we can create the names of the 50 or whatever target files And then tell the same recipe to create each of those 50 files But to pull this off we'll first have to learn a few concepts and make We'll learn how to create variables Then we'll learn how to use functions in make to do things like extract the path from the variables or the file names That are stored in the variables Then we'll also learn how to loop over a vector of values Finally, we'll see how we can use this list with the recipe We created in the last episode to automatically create all 50 files with a simple call to make from the command line But don't worry We'll go slow and in the end all these new tools will allow us to show off the automation powers of make and its ability to keep track Of the dependencies for the 50 files That's where its power for reproducible analysis will really shine Hopefully you've already gathered from this intro that even if you don't know what an ASV or a 16s RNA gene is You'll still get a lot out of this video never heard of make either no problem Give this episode of you, and I'm sure you'll have yourself thinking it needs to be the framework for your next data analysis pipeline Alrighty, let's go into our project root directory. It's green. We're in good shape I'm gonna go ahead and fire up Adam and we're only gonna be working with the make file today so I'll go ahead and open that up and At the very top I'm going to put in a special target that will call print under a hyphen percent So that's gonna be the the target and as we've seen before if I were to say print hyphen and then something That's something is gonna be the value of the percent sign Now something we haven't seen before is the echo function from make and so if I do at echo and then quote dollar sign star Equals dollar sign parentheses dollar sign star Don't worry about with all this means we've seen before that the target value would be a dollar sign at Well, if I want the value of what's in this percent sign what's matching that wild card that is the dollar sign star And so what we're doing here is saying give me the value of that wild card And then this is saying well give me the value that the wild card represents and to demonstrate that Let me do test equals Howdy Okay, and so now if I come back and I do make print hyphen test Should say test equals howdy right howdy so We're not gonna say test equals howdy instead We're gonna come back down to the bottom or towards the bottom of our workflow here All of these are our various different recipes and you'll see here that I have a recipe for building my ESVs for the different regions and so here the percent sign is matching those different Regions and then down here We have a similar one for the ASVs where we have the percent sign to represent the region and the zero one To represent the threshold of zero point zero one Okay, so we're gonna turn both of these into variables We're gonna generate all possible values using those variables Okay, and so that's what we're gonna do today where I'm gonna start is with this target data Percent R&D BSV countable that percent is doing the work of the four different regions that I have in my data set Right, I can copy this four times and for each I can give a different region three four V four five and V one nine and I can put these on the same line separated by space and I can call these my ESV Tibles equals that So now if I do make print ESV tibbles Then I get those four different tibbles now if I wanted to add a fifth region I would concatenate on another path. This gets a little bit tedious This certainly works and the way I would run this is that I could replace this target here With a percent sign and then inside parentheses the name of the variable so ESV tibbles and So this then represents all those four ESV tibbles, right? And so if I were to grab Say this one and do make n this It's complaining about code get ESV.R should be ESVs.R So save that Okay, so then it knows to run those but the problem that we see is that If we look back at our rule here it is dependent on This file which still has a percent sign in there, right? It was matching the percent sign that was in our target So those are two problems Again, if I have to add another region then I have to add another target and that can get kind of tedious and We need a way to get the region or the directory name from our file. So To get this to work Let's go ahead and first deal with this problem of the percent sign not representing the region that we're interested in, okay? So to do that what we'd like to do is that there is a a rule or target rather called second expansion and It doesn't take any Prerequisites and I'm actually going to cut this and put this back up here at the very top with my secondary rule and What this allows us to do is to use some special functions in Our list of prerequisites, right? And so what we're going to do is we're going to replace the data forward slash percent forward slash with a special function called durr and that's going to get the directory of our Target and so you'll notice that down here when we had the special functions Special variable dollar sign at we used a single dollar sign. Well in The list of prerequisites if we're going to use these variables or use these functions. We need to use a double Dollar sign and that's why we're using that secondary expansion Call, okay? So again what this will do is this will extract the directory information From my ESV table So it's going to basically take this value out and then append at the end of it R&D be count table All right, so let's give this another shot and we now see that sure enough if I do make data 34 R&D be ESV count table sure enough it gets the directory right and We're in good shape. Okay. So again what we've seen here is that we can use special functions that come with make To extract different parts of the path and so here we're getting the directory path of The name of our target, which is great and the other thing we're seeing is that If we're using these functions or using variables in the list of prerequisites We need that double dollar sign and as we saw and I put it at the very top. We need the second expansion target To be called before we do the secondary expansion Basically, what's happening? It's kind of like what we saw previously for that print function Is that the dollar sign at gives you the name of the target and then this gives you The the value itself, right? So it's it's kind of doing two layers of expansion of the name If that doesn't make sense to you then if it's in the prerequisites You need two dollar signs if it's down here in the recipe for the rule. You need a single dollar sign Okay, great. So that solved the problem Of the path in our prerequisites now Let's deal with this beast where we've got four targets and we can imagine down the road Maybe we would want a fifth or sixth or seventh or eighth target adding things on so what we can do is I'm going to create ESV-Tibble's temp as a way to try something out and I'm also going to create something called regions Which is going to be a variable like I had ESV-Tibble's that contains the names of my different regions so v4 v3 4 v4 5 v1 9 and again, I can do make print regions And it'll output my four regions. So that looks great And what I'm going to do is we're going to use a for each function So it's like dir and that it's a function and we can call that with a dollar sign parentheses For each and the syntax for the for each is going to be the variable name comma the vector So like regions and then basically after the comma what to do With of our name. Okay. So again, this is the general syntax so we'll do ESV tibbles temp Equals dollar sign that for each and then I'm going to use a capital R to indicate regions and The vector that we're going to iterate over is going to be dollar sign parentheses regions and Then we need to think about what we want to do with that, right? And so again, we're taking all those values of the regions those four values and Each time we go through this loop each time we go through this these steps The first time through R is gonna have the value of v4 the second time 3 4 the third time v4 5 the fourth time v1 9 And so what we're going to do then is we're going to build a path So we can do data forward slash dollar sign R forward slash This right yes R&D be ESV count tibble and that should work now Something to note is that I could certainly put our in parentheses with a dollar sign the dollar sign Uses either what's in the parentheses or the first character of the variable name I'll leave it like this. So it's more clear as You're thinking about how you might design and use your own variables So let me go ahead and copy this variable name. And now if I do make print hyphen ESV tibble's temp. I now see I Get my four paths back, right? And so this is really nice because what what I can do is if I were to say add another region which might be say V3 5 right now I change that one thing adding v3 5 to the regions vector and bam now I have The target for the v3 5 so that's excellent, right? so I'm going to Use that name for ESV tibbles and I'll go ahead and get rid of this thing up here and You know what? I'm gonna leave that there for now this looks good and again I can do make dash n this v3 5 1 and Let's see. Did I remove that? There we go. I think I forgot to save the output or something You can see that if we want to add a region then we could run everything through So this would probably throw an error because you might recall that extract regions that sh has an if statement in there That looks for the region to figure out which which coordinates to use to extract that region from the gene So I'm gonna go ahead for now and remove that v3 5 Save it, but of course, you know if I could do v1 9 instead I see the commands that it's going to run to build that for me The other thing that we talked about previously was that we can create these phony rules, right? And so I can say ESVs as a target and the prerequisites for ESVs is going to be ESV tibbles, right? And so can I save that and if I do make Dash and ESVs this is going to show me all the commands I need to run to build those ESV files So that's really nice, right? I can generate these four files running a simple make function Great, so I'm going to save running this for now and I want to turn my attention now To think you know and we'll skip over this Processed file and I'm actually going to move this down below Because in a future episode we're going to think about how we can combine both ASV and ESV data, but again, that's for another day so this all looks good, but in this case with the ASVs I have both the region as well as the threshold and so I'm going to create Another variable that I'll call thresholds I'm going to use values say zero zero one so that's going to be one and a thousand so if that's like a full-length sequence That's going to be one and a half differences based differences. I'll do oh two oh three oh four oh five Let's do oh eight oh one oh one five oh two Let's do oh two five oh three Oh four oh five right so I've got one two three four five six seven eight nine ten eleven twelve thirteen Different thresholds. I don't know that I need all these I might need more I might definitely different ones, but I've got them there And so the nice thing instead of having to make like I said earlier 50 different targets or so or 60 or whatever it is I can you'll you'll see that I can do the same type of thing we did with ESVs But for ASVs to build all those targets for those different thresholds and the different regions But we first have to figure out how to make ASVs tibble, right? And so what we will do is that we want to create ASV Tibles and I'm going to go ahead and for now I'm going to copy this down and We'll say equals that and so we've got our regions But I need a second for loop in here and so we'll do dollar sign parentheses for each and Then we will do T comma dollar sign parentheses thresholds good and Then inside this parentheses. I'm going to go ahead and bring this back up in and Instead of ESV. I'm going to do dollar sign t and I Want to make sure that I've got the right parentheses. So I've got one two open One still open twos open threes open twos open threes open two One zero so I think I've got the right number of parentheses. I can test this now by doing make print hyphen ASV tibbles and That worked right and so you can see that we've got The v4 with the 0 0 1 we've got v1 9 out to 0 5. Okay, so awesome So again, we can bring that in here as a replacement for our target So anything that is contained in this vector will get built using this these prerequisites and this rule We also see of course that we have The path and so like we did above we can do dollar sign dollar sign dur dollar sign dollar sign at And we're going to get rid of that extra forward slash And we'll do the same thing right over here for count table and Let's test this and if we do make dash n on this v1 9 count tibble file We see the three things that need to get run to build that file Excellent. Now again, I can create say ASV's as a target phony target made by ASV Tibles as the prerequisite And so if I were to do make dash n ASV's we see everything that gets run and the cool thing about make is that You know, it's gonna run the get unique seeks get ASV's for those different regions and then It will go ahead and generate the count tibbles for those different thresholds if it needs to build a distance matrix it does that once because that's the prerequisite for everything else and it make Is so good because it keeps track of all of our prerequisites for us And I can build out all of these output files with a very simple command from the command prompt Excellent, so what I'm gonna do is I'm going to create a variable another variable called a e or maybe a SV tibbles and that's gonna be all of the exact and ampli-con sequence variants and We'll then say equals dollar sign ESV tibbles and What we can then add on this other variable, so we're gonna concatenate two vectors together two variables together right and if I I'm gonna go ahead and remove these phony targets for now and I can do e ASV as a phony target and the prerequisite is the EISV tibbles good, and again if I do make dash n ESV I see everything that's gonna run to build out all those targets are ESV's as well as all of the ASV's and again It keeps track of the prerequisites and it only runs the commands it needs to build those prerequisites Which is really really slick, so I'm going to I'm actually gonna move these variables to the top of my make file I like to have those up top because Say I added a region Then I'd want that region to be available throughout the whole script same for the thresholds and everything else So that all looks good And I'm excited about this. I'm gonna go ahead and get rid of this comment for now and this is good and Yeah, so and then again like I said next time we'll come back and think about how we can pool all those tables together Using these various dependencies excellent, so I'm gonna go ahead and remove that Dash n and build out my EASV's it's probably gonna take a while to run I'll do some editing to speed things up a bit, but while this is running Why don't you go ahead and like this video subscribe to the channel and click that bell icon so you know when the next episode is released So you can hear the exciting conclusion of what we do with all these countable files All right, that took about five hours to run on my computer Not trivial, but at the same time we did all of that with one simple command make EASV Right and so it went back through our whole pipeline figured out what needed to be done and then did it so two things to note So first of all if you dig into the documentation for make there is an argument I believe it's dash j that you can then put in the number of processors you want to use to parallelize the process the other thing That what I've just shown you with make EASV is really powerful for those of you that are working on a high performance computer cluster or Like what I just did right where it was a job that was going to take five hours to run And so the reason why it's good for people on a high performance computer cluster is because they generally don't want you running commands Directly from the prompt like we have been because I'm working on my laptop, right? So you're typically submitting what are called jobs where you perhaps have a command like me make EASV You fire it off and then it runs in the background effectively and then lets you know when it's done, right? So again, it's useful for those cases So what I would encourage you to do is if you look back at our make file There are some other Targets that we have made through here To think about you know, how could you make a target or set of targets to create the distance matrices, right? We don't have to do this because we're using that wild card percent sign You might also think ahead down to This rule where we created the consolidated count table file How could you perhaps simplify the list of arguments to use a variable, right? So Give those a look and think about those and in the next episode we're going to work on This rule here to think about how we can coalesce all together our count table files from our EASV From our ESV's as well as our ASV's so keep practicing Please tell your friends about code club and we'll see you next time for the next episode