 Pete and repeat my two dear friends were sitting in a boat Pete fell out. Who was left? Pete and repeat my two dear friends were sitting a boat Pete fell out. Who was left? Repeat and Pete were sitting in a boat repeat fell out. Who was left? Oops. I made a mistake What is this annoying joke have to do with programming? Well keep watching and I'll tell you hey folks My name is patch loss and this is code club what I just illustrated by that annoying joke that is just so annoying is The value of dry dry is a principle within programming practice Which is an acronym for don't repeat yourself because as I illustrated by repeating that joke that when you repeat yourself You are prone to insert problems that in my case kill the joke although the joke was pretty much dead already Anyway in programming we want to minimize the amount of repetition in our code And so we want to keep our code dry if you've been following along in recent episodes of code club You know that I've been kind of working through illustrating to you how we might use a new r package developed by my research group called mikrope ml to identify biomarkers within a person's fecal microbiota that we can use to distinguish them from having normal colons as well as people with Severe polyps or screen relevant neoplasias within their colons and so as we've been going through I've been illustrating things Using a l2 regularized logistic regression model So we started with this run genus split dot our script This script runs for each seed of a random number generator We call this a hundred times that allows us to do a hundred eighty twenty splits 80% of the data is used for training and validation And then the held out 20% is used to take that model and then evaluate it to test it using that held out data So we do that a hundred times so we get a pretty good robust feel for how the model is performing and to make sure It's not overfit So we built this using genus level relative abundance data That's coming here in the select statement with the taxonomy and relevant columns We go through and we build out the rest of the model And then we also write the data out to a RDS file, which is a binary file created by our In a more recent episode we went back and we basically made the same exact script Only we added the column fit result fit result is a measure of the amount of blood in a person's stool sample So as you can see these two scripts are basically identical How identical well, it's kind of hard to see from looking at this vantage point So we use a couple handy functions from the command line that comes to us from a Linux type of environment So I want to look in the code directory at my two R scripts that end in split dot R And so again, we've got genus fit split dot R and then run genus split dot R And I can use a handy dandy function called diff to do code forward slash split dot R And what this output shows is a little bit confusing But what you'll see is that the code is outputted in chunks separated by three hyphens, right? And so we have these first two lines that correspond to run genus fit split dot R And then these two lines below the hyphens that correspond to code run genus split And so this is telling us that these two lines are different between the two files And as you can see in this first line the only thing that's different is the variable name And the second line the only thing that's different is the fit result Similarly, if you come down further into the file diff tells us that there's this line that's different and again That's the difference in the variable name and then down here Again, the only thing that's different is the variable name. So we have two scripts And there's really only one meaningful word that's different between the two R scripts, right? So as we think about perhaps adding more features that we want to add to Our model or perhaps using a different modeling approach We're going to keep it seems keep propagating different R scripts to carry out the analysis we want That is not dry. So again dry stands for don't repeat yourself. And as I've illustrated I'm clearly repeating myself where again, if I look back at my individual R scripts I've got about 32 lines of code in each of these R scripts And there's only one different line among those 32 lines between the two R scripts again That is not dry and the reason that that's a problem is because I keep copying and propagating this script I'm very likely to introduce bugs as I go through if you've watched any of my past episodes You know that I am very prone to introducing typos and doing silly things And so as much as possible I would like to have a single script that I can perhaps feed in arguments where the argument is the thing that is changing Um across the different executions, right? So maybe I'll have an argument for running it using just the genus level data with the relative abundance Perhaps I'll have another argument for feeding in that same data, but also the fit result or perhaps We'll also do it using a random forest But to do random forest we'll also have to perhaps give it different hyper parameter So as you go through You begin to see where all this replication and duplication could cause problems Because perhaps I want to use a set of hyper parameters for all my logistic regression models But I'm copying and pasting between different files and lo and behold I've changed the parameters between my different models my different runnings of this script, right? So say I have five different versions of this script for building a logistic regression model But I accidentally introduced changes into the hyper parameters that could cause a big problem, right? So if I had a single script to run the logistic regression But I only feed it the things that are going to change I will then have dry code We often see this in a single script And I have heard this called spaghetti code Where somebody might have a few hundred lines of code And they're repeating the same chunk of code multiple times I've seen this in my own lab where somebody was reading data coming out of Um, kind of a 96 well plate reader and so for each plate They had the same chunk replicated several times So if there were five plates they had that chunk repeated five times and each time They had a subtle tweak to how that plate was being read because the plates weren't all the same layout, right? And so instead of creating a single function that was called five times They were repeating that code five times and so again if they introduced a bug Into one of those chunks Then they would need to update that chunk as well as everything else and it just got really messy So again, we want to dry out our code as much as possible But as I mentioned we can use things like arguments. We can use functions We can use script files to help dry out our code As we've already seen in this our script that I've been talking about We source code genus processed out are that is a r script that does a bunch of preprocessing of our data to Bring together a relative abundance data and our metadata So we don't have to replicate that in all of our different r scripts So we've got some elements of dry coding already Um, but the point of today's episode is to see if we can't go further in drying out our code So we can then replicate what we're currently doing with those logistic regression models To do it also with random forest as I've already shown you with the output of the diff function here in my r studio Terminal window. There's not much different between these two scripts What I'd like to do is go ahead and consolidate these two r scripts into a single Run split dot r script that can be fed information about what is going on in the select line So that I only have one piece of code one one r script So let's go ahead back to our source window and see how we might go about doing that So to start I'm going to go ahead and rename my run genus split So I can check the box next to run genus split and do rename and I will go ahead and do run split dot r And so now I've got run split dot r So my plan here is to have a separate r script for each different type of model that I'm going to build That then will become an argument that I get through this command args function So currently we're getting the output file name, which is where we then also get the seed as well as the name of the output file Right, so we're gonna have another argument that we will will bring in And that will be The type of model right and that is going to be the name of the r script That is going to bring in all this extra goodness to keep this run split dot r script as dry as possible That r script is going to contain some version of this select line because that's the only thing at this point That's varying between our different r scripts, right? So I'll have a variable called feature script Which for now I'll call put in the code and will be L2 genus dot r So I haven't created that yet remember and then we'll do source on feature script And so that way then whatever this value is that comes to us from command args Will then be fed into source and then we'll replace this line 14 So I'll go ahead and copy that so I'll create a new r script and plop that select function in here And so I'm going to create a new function that I'll call feature select And this will be function and the argument coming into it will be x And that x is basically the composite data frame or whatever is upstream of this select function call And so then that will run select, but we need to pipe x into this select function, right? And so I'll get rid of that final pipe and again each different type of model that we create will Run its own version of feature select. So again, I will save this into code as l2 genus dot r Instead of this select line. What we'll now do is feature select Okay, and so let's run everything upstream and and hope for the best So I'll go ahead and source these lines. I'm going to go ahead and put in an output file name for testing purposes So I'll do output file is processed data forward slash l2 genus one dot rds And so then again the seed extracts the seed from that name So one and then we have a feature script and we'll source that and now we run this And it goes through without an error and we can then look at srn genus data and we get Again, all of the bacterial taxonomic names and we're in good shape So now we're set up to be dry, right? And so again, I have srn genus data I can remove the genus Probably from everywhere in my script here. Let me see where I have genus and remove genus preprocess Genus data and all these things we're doing to make our code more dry So we've moved genus from all of the variable names That was the other thing that when we did the diff was different between our different scripts So now we have run split dot r works Now what we need to do is get feature script from the command line to do that I'm going to go back to my make file We will now look at this chunk of code up here from our make file for where we were doing this Now I'll put in here code forward slash l2 genus dot r And so that is a dependency of the rule and We're putting in this dollar sign at which is the target names And so we can also then feed into this code forward slash l2 genus dot r as a input to Run not run genus split but run split dot r And so also remove runs genus split here And so now if we go back to run split dot r We now want to get feature script from our our our inputs, right? And so I'm going to save output file To be args 1 And my feature script will be args 2 And then this will be args And so we should be good to go now to go ahead and make That target that I had Put in here to kind of test things. So if I come back to my terminal I can do make process data l2 genus 1 dot rds Wonderful, so that ran through without any errors or any problems Now what I want to do is go ahead and create an l2 genus fit dot r script, right? And so that will hopefully replace this run genus fit split file And again, we can take very much the same idea that we had here But add fit result so we can then replace that line that we had in run genus fit split Right this so we can now use run split on r as a single r script to run all the modeling For both types of models, right? Again to repeat we'll grab another r script. I'll save this as l2 genus fit dot r And I will go ahead and grab this code And plop it into here, but as we saw we want this select line And we'll get rid of that final pipe save it. So now we've got l2 genus fit dot r We no longer need this run genus fit split r. So again, I'll come back to my code And we'll get rid of that run genus fit split And we'll delete So that's good. Now we need to update our make file as we saw earlier We're going to go ahead back in here and we'll remove the genus fit genus process We need to add code forward slash l2 genus fit dot r and then we'll do run split dot r and we will add the specific r script for this modeling approach As to the recipe for building out the target, right? So let's go ahead and save that and now Let's test that By coming back and instead of doing genus underscore one, we'll do underscore we'll do underscore fit underscore one We'll run that and everything should be good that ran without any issue So we're in good shape to go forward It might seem like we haven't gained a whole lot Because we've actually gained a file to have to keep track of But we've encapsulated the part that's varying from run to run or from model to model In its own separate r script Focusing on the thing that varies between the different Executions or the different types of models that we're creating and then this run split dot r Doesn't have to change because it's loading the information that's changing from these other r scripts And so this script remains dry and again We're kind of encapsulating away the parts that are changing now that we have the code working with our l2 regularized logistic regression Using genus level relative abundances and the fit result I now want to go ahead and see if we can't get it to work also with random forest This is going to introduce a few more complications because not only do we need to use that feature selection Function, but we will also need to tell Run split dot r what modeling approach we want and what hyper parameters do we want because the hyper parameters For the l2 logistic regression are different than those for random forest Let's go ahead and start simple by copying what we've done already with l2 for random forest I'll go ahead and create an r script that I will save into code And I will call this rf genus dot r We're going to grab this feature select because that is the same between l2 and random forest So again when we then run this before the random forest We're going to get this feature selection now what we want to think about are the hyper parameter again What we have down here in our code is test hp. I don't know why we had called it test hp I think I'll call this hyper parameter And of course we need that hyper parameter down here hyper parameters equals hyper parameter Let's just make our indenting look nice and these hyper parameters are what was used For the l2 so i'm going to go ahead and cut this out And put this up here as a variable in my l2 genus dot r And in my l2 genus fit dot r And of course I now need a hyper parameter variable for my rf genus dot r But these aren't the hyper parameters I want a hyper parameter I want is m try if you're not sure what hyper parameters are available for the modeling approach you want You can do get hyper parameters list and you can give it the name of your data frame So mine called srn data And I'll then I'll put in rf guess I hadn't loaded that I have 8 17 and 34 This is based on the total number of features in the data frame So for demonstration purposes of getting things dry I'll use these three m try values But know that when you're doing this for real with your own research You might want to futz with those m try values kind of like we did in a previous episode for logistic regression You basically want your test a uc to be the highest and so you pick the m try value that gives you the highest test a uc I'm going to go ahead and grab that 8 17 34 And I will say m try equals those And maybe I'll also go ahead and put in here 100. So these need to be separated by commas So again when this r script is loaded by source The hyper parameter comes in the other thing that we'd like to say is what is the modeling approach? I'll say approach Equals rf that is the modeling name for random forest And then down here where I have method. I can say approach All right, and so then that glm net will be the approach for my genus and genus fit glm net And we'll go ahead and copy that over for l2 So again now we have our approach our hyper parameters Our feature selection encapsulated out of this file again We now need to update our make file to build out our rf Genus rds files. So I'm going to go ahead and copy this l2 genus rds down here, and I'll replace the l2 with rf rf rf I think the first thing I'll do is rerun The previous logistic regression models I made And then I'll try it with this random forest set of scripts. So that ran well I'm not going to change that l2 to rf and I don't know how long this will take It might take a while But we'll let that run and hopefully everything goes smoothly So that took a long time to run, but I'm happy to see it made it through without any problems I will probably save running the rest of the hundred splits until I can you know Move everything up to my high performance computer and let it run there again That's one of the advantages of working with an hpc Is that I can run it up there forget about it as a job and then Be notified when the job is completed and I can use my computer down here for doing better things like say I don't know editing this video or responding to your all's very kind comments Down below as you become more proficient in identifying those duplicated blocks of code You will also become more proficient in preventing those duplicated blocks of code But please don't stress out about duplicated blocks of code If this is the first time you've thought about, you know, should I be duplicating my code? Should I be repeating myself? No, you shouldn't be but The most important thing is to get your code to work And if that means you have the same code chunk five times cool That is a place to start from right that works. It gives you the right answer But look ahead and realize that that's going to be really hard to maintain or add to as your analysis Gets more mature So identify that you've got those five blocks of code as you go through again You will find strategies to prevent those duplicated blocks of code in my own code here I've baked in dry principles without even thinking much about it, right? And so for example I run this script a hundred times for a hundred different seeds I obviously don't have the same script Copied a hundred times with a different seed, right? So feeding in an argument from the command line is one strategy to keeping the code dry Another place that you can help keep your code dry is through the use of variables, right? So if you can You know attach a data frame to a variable name Well, then you don't have to repeat this whole pipeline of composite down through select Every time you want to do something on that data, right? I have that now saved as srn data And then I can use srn data as input to the different steps of my pipeline So that also helps keep code dry Something else to think about is how you're creating file names So for example, I could have used the glue function to create the name of my output file, right? And so that would be something like processed file forward slash l2 genus underscore one through 100 dot rds, right? And so I could have hard coded that kind of down here at the end But what I decided to do instead is to put that all back up in my make file So I don't have to worry about the path or that special file name that tells me from looking at the file name What type of model was created, right? And so I keep that kind of creation of the file name back in the make file And so now I've got the file name as a variable that I can also extract the seed from Because I know that all of my output files will have the same format, right? So again, that is another way of keeping your code dry A final way that we talked about today is taking that part that varies between different copies of the script and pulling that out to its own r script That way I can look at each of the three r scripts that I have so far To modify the set variables that I need to modify To run those types of modeling framework again with these individual r scripts for the different modeling frameworks I'm using it gets very easy to add Different features that I want to change hyper parameters to change the approach Without having to come back in and change run split dot r If you are working within a make file environment, one of the other great things about having these separate Scripts with the parameters for each different modeling framework Is that that file run split dot r is a dependency of everything, right? So if I'm going in and I update run split dot r for logistic regression and I've already run random forest Well, I'm going to have to rerun everything for random forest, right? But if run split dot r doesn't change in all of these different modeling frameworks And I have to go in and say change the hyper parameters of l2genus dot r That's not going to trigger make to rebuild the random forest That's only going to affect where l2 genus dot r is called in the make file again Keeping things dry also makes things more efficient in how they're run Something that I will leave for you as a challenge from today's episode Is go ahead and see if you can create rf genus fit dot r. Think about what would you need to do? What would the r script look like to build a random forest model using genus and fit? Okay, see if you can do it Let me know if you run into any problems down in the comments The second thing that I will tell you to do as an assignment is look at your latest chunk of code Do you see elements in there that you are repeating? What strategies could you use in your r code to limit the repetition of the code across your r script? Again, when I'm working with people that are just getting the feet wet in programming with r I find that they might have an r script that has a few hundred lines of code And there's the same block or the same variable getting repeated many times Do you see evidence of that? Then go in and see if you can use some of the strategies I've discussed today to go ahead and dry it out So keep practicing with this again Don't expect you to get things perfect the first time I certainly don't get things perfect the first time as I've shown you here many many times But we want to identify problems and then work to solve them and along the way learn to do things better Till next time keep practicing and we'll see you for another episode of code club