 Interacting with the prompt at the command line or in programs like R is helpful for figuring out what you want your code to do and for troubleshooting any of your code. But if you spend too much time working from the prompt, your reproducibility will suffer. The solution to this problem is documenting all of your steps in a script. We've seen how to do this previously using R and bash. Today, we'll revisit bash scripting and learn how to run commands from the mother software package from the command line. Before we get going too far, I wanted to let you know I'm Pat Schloss, I'm a professor at the University of Michigan and the creator of the mother software package. Each episode of Code Club, I present various concepts that I use in my own research to improve the reproducibility of the analysis. Please be sure to subscribe to the channel and hit that bell icon so you know when the next episode is released. In the last episode, I discussed how most algorithms for identifying Amplicon sequence variants, or ASVs, allow for some variation between the sequences that are getting clustered together into the same ASV. This is part of the algorithm's process for denoising the data at the same time as clustering the sequences together. As we've seen, some of that noise is really signal, as a single genome can have multiple versions of the 16SR RNA gene. If ASVs truly represent the exact sequences, then they would split a single genome, really a single cell, into multiple ASVs. One of the more popular algorithms for identifying ASVs is data2. Data2 has a parameter called omega that affects how stringent ASVs are. It is used to determine whether an ASV should be split, merged, or left alone. The default value is 10 to the minus 40. If omega did what it was actually claimed to do, then it would probably be more appropriate for it to be more on the order of 10 to the minus 5, or maybe even 10 to the minus 10. Regardless, by fiddling with the omega parameter, you can modulate the stringency of the ASVs. It should also be noted that all of the other algorithms for generating ASVs, including U-noise, D-blur, and MotherZone pre-dot cluster, all have some parameter to impact the stringency of the ASVs. People don't like to acknowledge that an ASV is like an operational taxonomic unit or OTU, but it really is. Starting in today's episode, we're going to use our reference sequences to generate ASVs and OTUs to round out our analysis. Because we don't have the abundance data that are needed to generate ASVs by these other approaches, we'll cluster sequences together, based on their dissimilarity to each other, using values ranging from 0.1% to 5%. Generally we use 3% as a cutoff for operational taxonomic units or OTUs. Today, I'll show you three ways that you can run Mother in the interactive environment, using a script file, and running Mother commands directly from the command prompt, or within a bash script. I'll also show you how Mother's current values approach works when clustering sequences together. Now I know that not everyone uses Mother and not everyone has heard of Mother. That's fine. Honest. My feelings aren't hurt. If you stick around though, you might get some ideas and more practice with creating your own bash scripts and make file rules. I went ahead and created a new issue, issue number 34, to create ASVs. So as I've mentioned in the last episode and in the introduction, ASVs, when we actually use them on data that we're collecting from microbial communities using Amplicon sequencing, all the algorithms allow for a little bit of slop, a little bit of play. And so this idea that they are exact sequences is really not correct. I don't want to get into the whole discussion or debate over what these things are. I mean, in some ways that's what we're trying to do here. We're trying to kind of bring data to the debate. Anyway, for my purposes, exact sequence variants are the true sequences that I'm going to go ahead and trust that the genome sequences are the true sequences. And that the Amplicon sequence variants are not necessarily exact. They might be exact, but more than likely not. And so again, because all these algorithms allow for a little bit of play, I'm going to go ahead and create ASVs based on different distance thresholds. I know that's not exactly how they're done, but this is the best way, the best that I can do to simulate ASVs using the data we have, using the genome sequence data, right? The advantage of using the genome sequence data is that we know the true answer, right? We don't have to worry about sequencing error, or maybe I should say we don't have to worry about it as much, right? Anyway, so if I was to do this in mother with real data, I would use our pre-dot cluster command that allows you to cluster sequences that are within a handful of bases of a more abundant sequence. But because it's relative to a more abundant sequence, and we tend to trust those more abundant sequences more, it's all abundance based. We don't have abundances here, right? In some ways we do, but an ASV might be really abundant in one genome of one species, but it might be rare in another. And so it's just too messy. So what we're going to do instead is simulate it using different distance thresholds. You could perhaps think of like a distance threshold of like one in a thousand, so that would be like one nucleotide or so, one to two nucleotides for like a full length sequence. And then we'll go all the way up to say like a distance of like five percent. Typically for O2Us we use a distance of three percent, okay? So we're going to have to do some reworking, refactoring of our code. We talked about some of that in the last episode. And so I've created an issue here, as I mentioned, for issue 34. We're going to convert the UniqueSeqsSH script to GetUniqueSeqs, because there's a few files that we need, the UniqueAlign and CountTable, for calculating distances and then for, of course, getting our clusters and then generating an OutputTable. This is probably going to take us a couple episodes to work all the way through. We're going to start with converting this and then we'll go into using Mother and I'll show you three different ways that we can use Mother to run Mother code. Okay, so first we want to convert CountUniqueSeqs to GetUniqueSeqs. We're going to pull out some code and convert its name and then we'll update the make rules. So I'm going to come back over and go to my project root directory and I'll get branch issue 34 and get checkout issue 34. And again, we've got a separate issue branch here. It's green. We're good to go. I'll go ahead and open up Adam. We're mainly going to be working in Adam and the terminal for the next couple of episodes. So I'll go ahead and minimize some of these files in the directories. And as I mentioned, we want to convert CountUniqueSeqs to GetUniqueSeqs. So I'm going to go ahead and do GetMvCodeCountUniqueSeqs to code GetUniqueSeqs. And again, using GetWithMv allows us to keep track of that change. And so this is what the GetStatus output looks like. Okay, so we'll go ahead and look at GetUniqueSeqs. And that all looks good. Some funky formatting up at the top that I'm not going to worry about. And so what we want to do is we want to pull out this code, convert CountTable to Tible. I'm going to comment that for now. And that looks good. I'm going to do my testing using the v4 region because the v1 line tends to be much larger data set because there aren't as many duplicated sequences as we've seen with the v4 region. Okay, so great. I'm going to go ahead and create a target variable, which equals that. And then we'll go ahead and what I want to do is run some of these commands. And just to see what things look like for the v4 region, I'll go ahead and do LSLTH data v4. And so we see that we updated a fair number of things the last go around in the last episode. But we now have these temp files, temp align, and temp groups. And you'll recall that that's because in these said commands from many episodes ago, we parsed the header to only get to have kind of a unique identifier for each of the ASVs. And then we had unique SEQs and count SEQs, which I'll go ahead and run here. We're going to talk more about mother here in a few moments. And I forgot to copy and paste this line. So this is running again with the v4. It's much quicker. Generating that count SEQs file is a little bit slow. Okay, so that took a minute or so to run. Let's go ahead and look at what is in data v4. And we see we've got a bunch of temp files. We have temp count table and temp align or temp unique align. So this temp unique align and temp count table are the files that I want to be saving for subsequent steps. And so we see already here that we had moved temp unique align to unique align. Previously, we had gotten rid of stub temp count table. And we want to keep that as stub dot count table. And then I think if I go ahead and copy and paste these over, sometimes they don't hit the C with my copy as hard as I think I do. So look at v4. And so this looks cleaned up. So this all looks really good. Forget unique SEQs SH. Let me go to my make file. And we need to pull this apart. Let's see, where did we do that? Right here. And again, we're going to not create ESV count table, but it's going to be table. And this is then going to be get unique SEQs SH. And this is then going to be get unique SEQs SH. And so you'll note that this is probably at a point before we started incorporating those automatic variables. But as you've seen me do before, we could replace this with dollar sign less than to get the name of the script. Again, this is a place where it's helpful to have those because if I change the name of this, I have to remember to do it here as well. And so then this can come out because we're not using that R script anymore. But I do want to create a rule that has data, R, NDB, dot, these things. The ESV count table. And that is going to be the output. And the dependencies are going to be this R script as well. And it's going to use as input this count table file rather. And so we need to, I'm going to change the name of this to be get ESV dot R. And so we can again do get MV code. And I already forgot the name of that. Convert, right? Convert table dot R to get ESV dot R. And so that is good. All right. And so this is get ESV. It takes an input file and an output file. So the input is going to be that count table. The output is going to be the count table. And we specify that in our make file. And so again, we can use these automatic variables. So I'm going to use the caret because the caret gives me all of the dependencies. And then I'll add the at sign. And that's the output file, right? And so if I do make data v4 rndbesv dot count table, we do dash n. And so it's going to run those commands for me. So I'll go ahead and do that. This is going to take a few minutes. I'll edit it out. While it's running for you, go ahead and be sure that you subscribe to the channel. Only about a third of the people that watch this video are actually subscribed. So if you're part of those other two thirds, please go ahead and subscribe it. It makes me feel good. Click on the bell icon so you know when the next episode is released. So this went through. Let's go ahead and do LSLTH on data v4. And we see we've got our unique align, our count table, and the previous ESV count table file for our v4. Again, we could go back and do this for everything else. So I think we've done a good job of extracting out the ESV from the overall pipeline that we want to use to go ahead and generate those clusters. OK. So again, what we have to work with for generating the clusters are this rndb count table and the rndb unique align file. So I'm going to show you now three different ways that we can run mother. Typically, and again, mine is in mother, code mother, mother. Typically, a lot of people will run mother from the interactive mode like this. So when I teach workshops on mother, I primarily teach it from this environment where you've got a prompt and you can enter your commands. So I could do something like disk.seqs and then I could do fasta equals data v4 rndb.unique.align. And I can say cutoff equals 0.05. And this will run. And I see the output, right? And I've got to wait to do anything else. This will take a few moments to run through here. But you kind of get the idea that you run one command at a time. And it's interactive, right? Like it's going to give me a prompt here waiting for me to do the next step, OK? So that completes. And now I can do the next step, which I might do cluster. And I could say column. I could say column equals current. Or I could copy and paste this name. And I could then say count equals data v4 rndb.counttable cutoff equals say 0.01. And it'll now run this command, right? And we could keep going, right? So we could do a shared step. But you kind of get the idea that you can kind of interactively, like you might do in RStudio or in R, run one command at a time waiting for the prompt to kind of tell mother what to do next, OK? So we had two commands here. And I'm going to copy them because we all know how much I love typos. So I'm going to copy this. And I'm going to go ahead and quit out of mother. So that's one way to do it. Another way, I'm going to create a text file. And I'm going to remove all this stuff except for the actual mother commands, right? And so I'm going to save this. And I'm going to call it, say, asv.batch. And save that. And I can then come back to my prompt. And I can type code mother, mother, and then asv.batch. And running it this way, again, running it this way from the command line, I've given it a batch file that contains all of the commands that I want to run. And you can see then that it runs everything. This works really well if you've got a pipeline of commands that, say, you're always going to run. So say your lab always sequences the v4 region. You could kind of take that MySeq SOP that's on the mother wiki and put it into a batch file and then run it. And all you might need to change is the name of that input file, right? And so you don't even have to go into mother to run things. And sure enough, you get the output here, OK? A third approach we've already seen in one of the scripts I have here. And that would be to do, say, mother, or sorry, code forward slash mother, mother. And then you can put in quotes with a pound sign the list of your commands, right? And so I could then copy my commands from up here and then separate the commands with semicolons, right? All right. And so you can see now that I've got two commands here and that I end it with a closing quote, and that tells mother then to run those two commands from the command line, OK? So it'll give you the same output. It's various in level of interactivity. When I'm trying to figure things out, I go into interactive mode, right? When I'm still trying to figure out what processes or what steps I want to do for a particular analysis, I'll go into the interactive mode. If I've got a whole bunch of commands, and it's kind of like a generic analysis that I'm doing, I'll typically use the batch files. If it's kind of one or two commands that I'm trying to run, I generally do it from the command line using the quotes and the pound sign. The other nice thing about running it from the command line is that I can use the automatic variables. We've started incorporating some of that into the batch scripts for mother. But I still find that if I look at this, you'll see that I ran mother here from the command line, right? And that we've got the pound and the quotes. But I'm able to use my temple line temp groups variables to run mother. So those, again, are the three different ways of running mother. You can go into mother and at the prompt, give commands. You can give mother a script, if you will, of all the commands you want mother to run, or you can run it directly from the command line. So these are the different options. I'm going to primarily be using the command line approach as we go forward. And in fact, you've seen me do this already. So what I want to do next, I'm going to go ahead and remove that ASV batch file because I'm not going to use it, but I'm going to go ahead and touch code get ASVs.sh. And I will come to Adam. And I'm going to copy, so we don't need that, I'm copy some of the code from the get unique seeks. So I'm going to copy some of this header. And I'm going to have to change this later. But we want to think now about how we want to run this script. What is the target? And so what I'm going to give it is the region. And the output I want will be count, the table, kind of like we saw with the ESVs. So before this was our target. But what I'm going to do is I'm going to put in a distance threshold instead of ESV. So I might put in 0.01, or RNDB, 0.01 point count table. And so that would say, give me distances at a 1% threshold. And so we would like, we can then parse this name to get different arguments for running mother. So what we're going to want to do is, so we're going to want to do something like disk.seqs, where we give it column equals something with a cutoff of 0.05. And then we're going to want to do cluster with what? A cutoff of, say, 0.01. And we're going to want to give it, so we'll say, column equals current, and count equals something. And so we've got those two commands. And then we're going to do make.shared. I guess, yeah, and that should work. Good. So now we need to fill in the arguments, right? So one of the other things that I forgot to mention about running these commands in mother is that if I kind of cycle back through here, then I don't necessarily need to give cluster my column argument, right? Or I could say column equals current and run that. And it will know to use the current distance file that it has in memory. So you see that it's using that as my distance matrix. Alternatively, I don't even need to give it the column argument, the distance matrix argument, because it knows that cluster is supposed to use a distance matrix, right? And so again, I can run this and it'll take a few moments, but we'll see that it's still using that distance matrix. And sure enough, you see it's using that distance matrix and everything runs well. So coming back to my script here, I know that I can give it the column. I don't need to necessarily give it the column here. Sometimes, again, being explicit in arguments is helpful. But you'll see make.shared doesn't need any arguments because it's going to take the output from cluster and the count file that I use here as an argument for running make.shared. Now, something that occurs to me is that I'm going to run this script a bunch of times, right? I'm going to run it for many different distance thresholds, maybe from 0.1% up to 5% with many increments in between. So I don't want to generate this distance matrix over and over and over again. So what I'm going to do instead is I'm going to make a new another script that I'll call getDistances. So touch code getDistances.sh. And if I, again, return to Adam and let's copy that into here. And here, we're going to have rndb.unique.dist. And here, I'm going to change, I'm going to substitute dist for align, right? And so we're going to use, I'm going to call this align. And I don't need that temp file. And here, I'm only going to run dist.seqs and I'm going to give it my align file. So I can then put in column align. That should be good. And I'm going to go ahead and create a target for this in my make file. And I'll come down here after all the ESV stuff. And so that's going to be the target. The prerequisite is going to be unique align. And of course, up ahead of that, I'm going to need code forward slash getDistances.sh. And I also need to set mother as a dependency. And we will then do the automatic variable. And the target, because we're going to give it the target. And I need to remember to make this executable, right? So hopefully you remember that from many episodes ago that if we create a bash script, we need to go ahead and make it executable. Otherwise we get an error about it not knowing what's going on. So I'll go ahead and do makeN on that. And you see that the code it's going to run is code getDistances.sh. And then it gives it the target. So I'll go ahead and run that. And it's complaining. And I said in my code, I used column rather than fastA. And now it's working and everything is good. And so this getDistances.sh is in good shape. And I'm going to say it takes in a fastA file and calculates a distance file in a mother's column format. The default is to use a cutoff of 0.05. And so the input is the target and that will be the distance matrix, right? So we're good with getDistances, we can close that. And so now we come back here and we're not going to run desix, we're going to run cluster and then makeShared. And so let's go ahead and create a few variables. And so again, I'm going to do target equaling this thing, okay? And I can then say my stub, I don't need the stub temp, I don't believe is that, which is good. And I'm going to have, I need a column, or I'll say distances equals that, count equals something and then cutoff equals something else. And yeah, so we can then say, let's see, our distances are going to be the, those on target.unique.dist. And then our count is going to be target.counttable, which is good. And then our cutoff, we're going to process a little bit differently. So let me set threshold as a variable. And I'm going to copy this code, which you'll recall that we used said to pull out the path from a project route up to the RNDB. So I actually want the stuff after the RNDB, which is going to consist of a period. And then in there is going to be a series of numbers, right? And then the output will be countTibble. And then we're going to replace all that with whatever's in parentheses. So let me double check that that works. So then if I do echo dollar sign threshold, we get zero one. And so then my cutoff is going to be zero point dollar sign threshold. And then echo dollar sign cutoff, zero point zero one, excellent. And so now I need to fill in all of my mother arguments. So this is going to be distances, my cutoff, my count is going to be dollar sign count. And my cutoff is going to be dollar sign cutoff. And I can go ahead and put that, make sure it on a separate line. And we'll go ahead and run this. And again, this is going to take a few moments to run, especially if it complains. I need to actually load my arguments. So let me copy and paste the whole thing. And it's not happy because why? So it can't open. So if we look at what it's actually putting in, we see it's got zero one countTibble unique dist. So wherever I'm making my distances is wrong. I didn't want target, I wanted stub. So bonus points to you, if you caught that. So let's copy and paste this in, that looks good. And this will take a few moments. The cluster runs pretty quickly. The make shared part takes a bit longer. While that's running, I'm going to go ahead and copy these output file names because I'm not really going to need them for subsequent steps. And so we'll add that down here. As part of our, as I call it garbage collection. So garbage collection and programming, I think refers to how it treats certain variables and when the variables get tossed. And so here we're going to toss the variables. And I'm going to replace these, the stub with the stub variable. All right. And we'll see if any other temporary variables or file names get created that we'll want to delete. But I kind of doubt it. And while that's running again, I'll come back up and modify my header here. So it takes in a FASTA and Unix file. Oh, and Unix, so that's old, right? So it takes in distance file and account file and a cutoff value and generates a tibble file, a count tibble file. Let's see. Yeah. And we'll say to indicate the number of times each ASV appears in each genome, okay? So input is the target. And this is going to be a file in the format file name of say RNDB or let's do data V4 RNDB01.counttibble and sometimes Adam adds all sorts of crazy stuff. All right. And this file name is also the output. And the 01 in this example is the threshold. So that looks good. And this has run and hasn't created any other variables that we'll want to get rid of just yet. And so it creates this MCC up to MCC dot shared file. And I want to save that because in the next episode we're going to come back and we're going to convert this shared file into our tibble, right? So we're going to effectively copy this or move this to be data V4 RNDB.01.counttibble, right? But we need to write some R code to do that conversion of the shared file into a tibble. So this is good for now. Let's go ahead to our make file. And we'll say do data V4 RNDB.01.counttibble. So, you know, we haven't created the count tibble yet but make doesn't know that, okay? And so we'll do code get asv.sh. And the dependencies for this are going to be again our distance file as well as our RNDB.counttibble as well as code mother mother and they will put the mother files on their own line and that looks good. Something that stands out is that I've been using V4 hard coded but I can replace that with a percent sign so I can get this output file for any of the regions if I use that percent sign. And we are going to then do, let's see. So it's going to be that, the first prerequisite as well as the target. And again, we need to make this file executable. See if we want that. Huh, get asv, get asv's. All right, so that's good. And so I'm going to go ahead and do make, let's do make init, data v4 and we'll do db01.counttibble. And so it's going to run get distances as well as get asv's probably because I probably just saved get distances. And so we'll run this. It's going to take a few moments. Hopefully there won't be any bugs, any problems and I'll do some editing and we'll see in just a second. Okay, so that ran through. As it was running, I was reminded that, just as I called it get asv's, I have getesv.r. That's just going to drive me nuts. So I'll go ahead and do get mv, code, get asv's or esv to be code getesv.r. And I'll need to update my make file because that will cause all sorts of problems if I don't. And I get esv's and again, I use the automatic variable. So I don't, just because I changed the name here, I don't have to change it there. So that's winning. Again, the downside of using the automatic variables is the code isn't quite as readable. And then in my getuniqueseqs, I had this line commented out. I'm going to go ahead and remove that and save it. And I think we're in good shape. So I'm going to stop here. I'm gonna, I'll commit everything in a few moments. But what I wanna leave you with is thinking about a few things. So first of all, I've got v4. This needs to be a percent sign. That's a percent sign. So good. So now we can run this for any region, but we can only use it for one distance threshold, right? Zero one. So I could run this and do, say zero, zero one, right? But for every threshold, I'm gonna have to have a separate rule. And that's just horribly tedious. So in a future episode, probably not the next one because we're gonna be converting our shared file into this count table file. But the one after that, we're gonna talk about how we can generate object names, variable names, target names, whatever in make so that I don't have to, have a bazillion rules for every distance threshold we want to run. And that's gonna be really powerful. It's a little bit of an esoteric higher level step in thinking about make files. But I think you'll see it's pretty cool. And I'll show you an approach that makes it a lot easier than it could be to generate these target names. All right. So we've done a lot today. Hopefully you learned something new about using mother. I do teach mother workshops. So go ahead. If you're interested in taking a workshop in the future, I don't wanna be shilling for myself. But those workshops are posted on the mother website if you wanna learn more about those and learn more about how we use mother. So again, thanks for watching. I really do get a lot out of seeing people's compliments and people's questions and seeing the views that we get. So please keep them coming. Be sure you tell your friends about what we're doing over here at Code Club. So keep practicing and I'll see you next time for another episode of Code Club.