 There's a popular saying that's attributed to Phil Carlton. There are only two hard things in computer science, cash in validation and naming things. Don't worry, I don't have a clue what cash in validation is and we're not going to talk about it, but I can't tell you how right he is about naming things. Throughout the past 30 or so episodes, I've been working on a project that I've since realized I didn't name correctly. In today's episode, I'm going to refactor my code to reflect a better name and we'll talk about how our reproducible practices of using make and version control can help us out. But before we get going too far, I wanted to let you know that I'm Pat Schloss and this is Code Club, where we learn about reproducible practices to improve our data analysis skills. Please be sure to subscribe to the channel and hit that bell icon so you know when the next episode is released. Alrighty, let's get going. For the past 32 episodes or so, I've been investigating what I've been calling Amplicon sequence variants or ASVs. ASVs are the rage in microbial ecology right now. People want to use each 16S RNA gene sequence to represent a different bacterial group. As we've shown, there's a significant risk of splitting a single cell in the multiple groups because a genome can have multiple different ASVs. Furthermore, we've also shown that an ASV can appear in multiple species. Not good, eh? To do this analysis, I've been using data obtained from the RNDB, which provides the 16S RNA gene sequences for thousands of bacterial genomes. Because we trust those genome sequences, we've been treating each sequence as a distinct ASV. The problem, though, is that in actual application, like if I'm sequencing 16S genes from a soil sample, I don't have as much confidence in those data as I do from the genome sequences. So when people generate ASVs using popular tools like Data2, U-Noise, D-Blur, or Mother's Own Precluster, they allow it for a bit of slop to clump together sequences that are within a few bases of each other because they know there's some sequencing error. Who cares, right? Well, there's conflicting jargon in the field. Some call these things Amplicon sequence variants or ASVs, as we have been, and others use exact sequence variants or ESVs. I think most people use them interchangeably because they haven't thought too deeply about what the names imply, and they haven't really dug into the methods. Considering the methods allow for some slop because of residual sequencing error, I will start referring sequences from the clean genomic data that we've been working with as ESVs because they don't allow for any slop. The other data that has some slop will call ASVs. In the next few episodes, we'll go on to looking at how we can include some of that slop in those units, which again, I'll start calling ASVs. We'll talk about how we'll calculate the ASVs in the next episode or two, but to get ready for that, we need to replace ASV with ESV throughout the analysis. I'm sure people are closing the video in droves right now and running away from YouTube. Hold on! Although there's a lot of find all replace all to make this change, we'll also see how we can use make and get to help us refactor our code. We'll see how we can convince make to keep our intermediate and secondary files, how we can use patterns to tidy up our rules and make, and how we can use phony rules and make to build a bunch of targets all at once. We'll also use get to protect ourselves in case we get overzealous and changing or deleting anything. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. As you see, I've already created an issue for today's work that we're going to refactor the project to indicate that we've been working with ESVs rather than ASVs, and then we want to use the refactored code to go ahead and rebuild everything, regenerate all the data and all the analysis. In future episodes, of course, we'll come back and we'll work with ASVs. Okay, so we're going to create a branch for our issue. The other benefit of having these issues is these issue branches is that if I royally screw everything up, I can burn the branch to the ground and I haven't lost anything. Of course, we can do that with anything and get even without those issue branches, but it's kind of compartmentalizes our changes a little bit more. I'll go ahead and use Adam to open up where we are for today. So within Adam, I'm going to do some kind of brute force across the project changes. And we'll do that with the find in project. So if you do shift command F, that will open up this tab. And so you'll know you're there because it says find in project. And so in the project, I'm going to do ASV, all caps, and I'm going to click on these letters over here to match the case. And I'm going to replace that with ESV. And I'm going to put in exclamation point dot start up MD. And so this should change everything except for the markdown documents. Those I don't really care about because I'm going to end up deleting them or regenerating them anyway. I can do find all. And we said there's 60 results and five files. So if I click on these little triangles or pointers, I see that those changes are in my exploratory files and may read me file. So I'll go ahead and change those. I guess. And then I'm going to do ASV, ESV in lowercase. Find all, replace all. Of course, doing this one by one would just be horribly painful. I think there's also some in my R script. So if I search then in my start out R files, maybe I'll remove the case sensitivity. Doesn't seem to have found anything in an R file. Maybe if I'll do code forward slash R. Nope, doesn't have it there. Okay. I don't quite believe that. But I guess maybe it already had done it. Oh yeah, because I said don't change things in a markdown file but change everything else. So we see here convert count table to tibble already changed ASV to ESV count unique seeks did it here as well. And one of the nice things about Adam is that if things are under version control, it's turning things that have changed to this kind of brownish mustard color, right? So that is good. So I want to double check my read me file because in here I had said how many ESVs and so I want to say ESVs or ASVs kind of getting ready for looking at ASVs. And I think this all looks pretty good. I've got some extra text here that I think I'll get rid of. I'm not sure what that was even about. I'll save that. This all looks pretty good. Oh, I changed the title to utility of ESVs and ASVs. Okay, so I'll save that close that. So one of the things that I'd also like to do is if I look at data v19, I see that I've got these R&DB count tibble files. So I'd like these to be rndb.esv.count tibble because down the road we'll also have ASVs. And so I'd like to be able to keep two separate files. Maybe we'll wind up joining them together. But for now I want to work with those as ESVs. So I think I set those in my make file. So we create the file in count table to tibble. And this takes an input file name and an output file name from our make file. And so if we look at, where was it? Convert here, then let's see. This could be ESV.count tibble, and that's in processed. And we'll do the same thing here. ESV, I'll do some copying and pasting. That's pooling. That's the pooled file. And here is the individual files for the individual regions. So I'll do ESV.count tibble and that all looks good. And so perhaps we need to modify countuniqueceaks.sh. Double check that. And so that takes in the target. So it's taking the name from the file. And I think that is good shape. And I think everything is good there. All right, so we'll save that and we'll save our make file. Now, one of the things that we need to do is there's a lot of stuff that we need to clean up. So again, if I do get status, I see I've modified all these files. And something I notice is that some of our exploratory files have ASV in them, rather than ESV. So I can change those by doing get MV. And so by using get with MV, then we're keeping track of the old version of the file. So get MV this to this, but I'm going to change this to the A to an E. And that was the coverage one. And so we also want to change tax overlap and return that second ASV to an ESV. And then there's one more here, this commonness and dominance one. And we'll change that to an E. And I think we're good. And we can see that it's now keeping track of the renaming. Very good. What I'm going to do is I'm going to go ahead and commit these changes. And we can always revise the commit. But what the commit allows us to do is to potentially roll back any changes to this point going forward if we just totally botch everything up. So I will go ahead and do get add exploratory everything an exploratory my make file my read me a RMD file and my code stuff my code. And so that's all staged. So I'll say get commit dash M rename files from ASV to ESV and maybe not files but rename ASVs to ESVs that I forgot to close and quote. So I'll put that in. And as we see from the my ability type as we see from this issue 33 tag it's green. So everything is good to go and we're in good shape. All right. So now what I want to do is start burning things down start deleting things and so let's go ahead and remove data that's in the variable regions and the things that start with R and DB. So I'll remove all that. And so if I look at data before say it's empty and if I look at data processed I want to get rid of that also. So we'll move data processed R and DB and I also want to get rid of all the exploratory stuff. So you might say well maybe we should keep that stuff under version control to kind of updating the changes but we effectively are because we're going to be regenerating a lot of that stuff. I guess those files that went from ASV to ESV will be gone but that's cool. So what we can then do is remove exploratory and then start at MD and so that's anything that ends with period MD the rendered files as well as I'm going to RM-RF and so that's recursive and force exploratory anything that ends in files. So that's all the images that were there. And so if we look at this we see a lot of stuff got tossed and so again this is why I really like using git status because I noticed that I'm accidentally removing my readme file so I can do git restore exploratory readme and my autocomplete tab isn't working because it's not there right.md and let's make sure that everything looks right. So all these things I've just deleted we should be able to regenerate using make and if I open up my make file you'll recall that whatever I made one of these new rules I got a little bit uneasy because I'm basically repeating the same chunk over and over again but changing the names of the files. At the top of this I have a little reminder of how you write a rule in a make file. So you have a target so like the file you want so say I want this data references silver seed reference alignment that's the target that's what we want the computer to build and then after the colon we have all the prerequisites. Then on the next line after a tab we basically have the instructions for going from the prerequisites to the target call that the recipe and as I scan through here you'll notice that in various places I have these percent signs and up here we'd use those percent signs to represent the different variable regions that we wanted and so that is kind of like a wild card and a special kind of wild card in make. So what I am going to think about doing is if I do the percent sign.md so basically match anything that ends in .md and I'm going to then as a dependency require that same pattern but ending in rmd and so that will take our markdown document and render it to the markdown document and I also then want these two files which I'm going to copy and paste up here. You may be notice that this first rule doesn't require the genome ID taxonomy file. If I over specify the prerequisites it's not a big deal. The other thing that you'll notice is that at the end of these lines where there's a lot of prerequisites or the names are quite long. If you put a backslash that allows you to continue the line. Then what we will do is I will copy this rline and we just want one tab and we're going to render not that file but we're going to use one of the automatic variables which is the less than sign the one with the small sign the small edge next to the dollar sign and what that does is that means take this value whatever it ends up being and shove it in there right so we'll go ahead and save that so what I'll then do is I'll go ahead and delete all this text and save it and you might be saying well we've also got this rule here for creating a readme.md file so what make will do is it'll look for the most specific target first and then use that rule and so for all of our exploratory files it'll come back and it will then use this so you might need to revise this again in the future but we'll see something else that stands out to me is that we don't want rndb we want rndb.esv.countTibble and so that matches this target here and of course then we're going to want to change all of our markdown documents to also have esv so I'll go ahead and open these up and I guess I could have done a find all replace all but shouldn't be too bad to go ahead and modify these so of course we're going to rerun everything using make and if we run into any errors it will definitely tell us that there are errors um in a way this refactoring is good because it's forcing us to make sure that our work is reproducible right we should be getting the same results that we had initially after we've refactored everything the refactoring should not change the results because we're not really changing any of the code we're changing what we're calling things okay good so we've seen how we can simplify that long list of like six different rules we had for generating those exploratory data files and so we are in good shape um let's go ahead and think about how we would go about regenerating uh saying one of these files so I'm going to copy this first exploratory file that I deleted and I'll do make dash n space and then the target and that dash n is going to tell us everything that make needs to run to regenerate all this stuff right um one of the things I notice at the end of generating that is that it then goes back and deletes all of the dependencies it had to create okay and so again remember make was used was created to compile source code to create a final executable and there's no reason when you're creating that final executable to keep track of all the prerequisites right or the intermediate targets or what they call secondary targets for our data analysis we'd like to keep that around so what we can do is at the very top of of our make file I can put dot secondary with a colon and so this is a special rule that's built into make and this will this one line just dot secondary colon will tell make to keep all of the secondary files so now if I redo that make dash n we now see that we no longer have that remove line right so we ran the same code all we change is that one line and that remove goes away so this would generate everything to then create generate all of the files we need to then generate this output file okay so one other thing that I want to show you about make before I do that is that it's possible to create what are called phony rules and these are rules that don't really have a target and don't necessarily or I guess yeah they don't really have a target right so what you'll sometimes see when you're compiling code is you'll say like make all or make install and that is not a real you're not creating something called install or all or sometimes you'll do make clean you're not creating something called clean you're following the rule called all or clean and I'm going to create one called exploratory exploratory and it's going to be the target is going to be called exploratory and then the colon is going to be all these md files I I deleted and so maybe what I'll do because I'm a little lazy I'm a lot lazy so I'll do get status and I'll pipe that to grep and I will then pull out all the things that md and okay so the backslash forces it to be a period rather than a meta character don't worry if that's over your head I'm going to copy this in here and get rid of all that deleted stuff and I'll go ahead and tab these all in a notch and I'll put the backslash at the end of each line too many there save that and now if I do make exploratory it says nothing to be done for exploratory the directory and oops it's running I didn't want it to run ah so if I do make dash and exploratory it will then run all these things one of the concerns though is that exploratory is also the name of a directory as we saw and so what I'd like to do is use another special rule called dot phony okay and so that tells tells make that exploratory is a phony target it's not a real target to be generated and and so now if I do make an exploratory it will then regenerate all these things I think I hit control c quick enough before all this stuff got very far that I didn't cause any problems I think it got into v19 so I'll look at LS data v19 and I see that I still have some stuff there so I'll go ahead and remove that rndv stuff and again if I do make an exploratory it says doesn't have a rule to make ah this asv species coverage needed by exploratory so let me um ah when I copy it and paste it I left in the asv so I'll save that look at this and now the error messages go away I think it'll work so I'll go ahead and do make exploratory and this is going to take a while to run so I'll run this and I if you're if you're doing this alongside of me that's awesome um I'm going to do some creative editing so that we don't just stare at a blank screen or watch the scroll here for a few minutes while it's running and you're looking for something to do instead of going um and reading the web or checking your insta your facebook or twitter accounts um go ahead and like and subscribe this video be sure you click on that bell icon so you know when the next video is released um and we'll be back in just a couple seconds through the wonders of editing so it appears that we ran into a bit of a problem um processing through the v19 data and if I do ls-lth data v19 um I see that there is count dot tibble but esv count dot tibble doesn't exist all right so let's double check um where we generated that uh so that's there rndb esv count dot tibble um and it was coming out of code count unique seeks dot sh so let's come up code count unique seeks so it springs in the target it generates the unique file and uh I think our stub might not be happy so let's practice let's try this out running the code through um each line and so I'm gonna so I think instead of now that I think about it a little bit more that I don't want it to I don't want the target to be dot esv dot unique a line um I want the tibble file to be dot unique dot esv so this where I run the convert count table to tibble dot r that should output esv count dot tibble I think this should work now I'm gonna go ahead and give it another run um and we'll go ahead and do make exploratory nope without that forward slash and we'll see how things go all right so it looks like we ran through and got to rendering the exploratory data analysis files there were no other errors uh further up that I can see but it hit this first one and it's complaining that this file doesn't exist uh the esv dot count tibble file doesn't exist in data processed so let me look and see what's in data processed and I see sure enough rndb dot count tibble um so let me go back to my code and this was generated in the combined count tibble files and let's see I see sure enough we had rndb dot count dot tibble that should be esv count tibble let me double check in our make file that we have that so that's right there so I'll go ahead and remove data processed rndb count tibble and then if I do make dot n exploratory again without that slash um we see that I'll go ahead and combine it and then I'll re-render it so again fits and starts but because we have make it allows us to figure out where our dependencies aren't quite where we need them to be and how we need to modify our code to get it all to work smoothly all right so we hit another snag here um in generating the first of the exploratory files so I'm going to go ahead and open up that r markdown document let's see that was 10 5 I guess it wasn't the first one um I'm kind of wondering what happened with these others that they they went through so okay well if there was a problem here so maybe there's a problem with my code so I'm going to start by um doing something a little bit different um I'm going to run r from the terminal here we could do it in our studio um for these types of things where I'm trying to debug things in my make file um I like to it's easier for me sometimes to do it in the terminal and it's just good to know how to do it so I'll go ahead and copy and paste those in let me look at metadata esv to make sure that all looks good um and one thing I notice sure enough is that this region column is supposed to be a region but it's actually um text it's it's the file path I think so if I look at metadata esv um select region I have the paths and so that tells me that um something didn't get parsed right when we pooled the data so let's look back here and we see sure enough um you know we we parse out the region here in this parentheses and so I need to have rndb.esv.counttable there uh let me quit out of r we see all the stuff we've changed if I do make make dash and exploratory that we see everything that needs to get rerun and so we'll go ahead and do that and hopefully this does the trick well wonderful it looks like everything finally went through without a hitch no errors and we're in good shape so that makes me very happy let's go ahead and do get status to see where we're at I'm mainly looking at these deleted lines to understand maybe what have we deleted or gotten rid of um that we didn't intend to right um and so those all look pretty good um one thing that occurred to me as I was running all this that I don't know that I did update is my readme file uh so if I do not get make readme.md uh that needed to be re-rendered okay so everything looks good and like everything's been updated um again the value of having version control is that uh we've accidentally deleted something like we saw before with one of those readme files it was easy enough to do get restore and to bring it back to life so to speak so I will go ahead and do um add all this stuff so all that stuff oh I guess I didn't add the code or the thought I didn't oh I guess because it didn't like one of the arguments it didn't like all the arguments so I'll do get add make file readme.md and code good everything's added um I'm gonna add this to the previous commit because I think everything looks great so I'll do get commit hyphen hyphen amend this brings us um into nano um and I'm gonna say rename asvs to esvs and regenerate all data okay and save that exit and work good um I need to amend that again because I want to say closes number 33 and then we will do get checkout master get merge issue 33 um because of the way that timestamps work on version control on get when you merge a branch like we just did um that actually screws up all of our dependencies so we wanted to be we want to double check that everything was right we wouldn't want to rerun now make exploratory I'm not going to subject you to watching that or sitting through that again um everything will work uh so I'll go ahead and do get push and uh life should be good from here on in so again uh not a big concept that we covered in today's episode I was changing the name from asv to esv throughout the project and that gave us a good opportunity to delete all the intermediate files and then to see how we can use make to regenerate those intermediate files make sure that they don't get deleted um and make sure that we've got all our dependencies set up correctly so that if we rerun it then everything would work well maybe in a future episode what I should do is accidentally delete my whole project directory and show you how I can do a get clone where I get a copy of the repository from get and then I can do make exploratory again and regenerate everything so again this combination of version control and make are really instrumental to me for my doing a reproducible data analysis and I hope you find the same thing so keep practicing these concepts don't be afraid to make large overhauls of your project again using tools with like make and get will make it a lot easier and a lot safer so please tell your friends about these code club episodes of this is your first one welcome feel free to go back and check out previous episodes we're going to keep going now talking about amplicon sequence variants for real um and how we might generate those using the genome data that we've gotten from the rndb so we'll see you next time for another episode of code club