 In today's episode, we're going to discuss two concepts, functions, and how to use those functions multiple times with different inputs. The map functions in the perR package make it easier for us to do these types of operations. We've reached a point in our analysis where we need to repeat a chunk of code a hundred or maybe a thousand times. I'm Pat Schloss, and this is Code Club, where we learn about reproducible practices to improve our data analysis skills. Please be sure to subscribe to the channel and smash that bell icon, so you know when the next episode is released. Over the past few episodes, we've been working in an issue branch where the goal was to calculate the percentage of Amplicon Sequence Brands, or ASVs, that are shared across taxa within the same taxonomic rank. We started out by using all of the genomes in the RNDB. Then we found a way to perform the analysis by randomly selecting the same number of genomes from each species and then removing those species that didn't have the desired number of genomes. Of course, if we picked five genomes from each species and some species had dozens or even hundreds of species, we could get lucky or unlucky in which genomes we picked. Therefore, what we have left for ourselves today is to figure out how to repeat that analysis a large number of times and report the average and confidence interval over those iterations. To do this, we'll need to do a few things. First, we'll need to make a function. We've seen numerous functions over the past episodes. Everything from select to C is a function in R. If there's a string followed by an open parenthesis, it's a function. Well, we can write our own functions too. We will convert what we did in the last episode into a function where we tell it how many genomes we want to sample from each species. The second thing we need to do is to rerun that function many times. We'll do this using a map function which comes to us from the per R package. By the way, have you ever noticed that if you want to have a really successful and awesome R packages, it's got to have some gratuitous number of Rs in it? Anyway, once we've run our function 100 or 1000 times, we'll output the data in a figure and a table like we have in the past few episodes. Over these 100 or 1000 iterations, we'll be able to calculate the average as well as a 95 percent confidence interval. This is going to be great. Even if you're only watching this video to learn more about R and don't know what a 16S RNA gene is, I'm sure you'll get a lot out of today's episode. There's a lot of places where you might need to use a map function other than analyzing 16S genes. Please be sure to check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. I've gone ahead and added a comment to my issue, commenting that we need to be sure that we repeat the analysis say 100 or 1000 times. I think we're going to stick to 100 because this might take a long time to run. Maybe if we're doing it for the paper, for there is a paper in the future, we'll do it a thousand times, but for this exploratory analysis that we're doing, we're going to do it for 100, I think. We'll be able to output the mean overlap between different taxa as well as the 95 percent confidence interval for each taxonomic rank to give us a sense of how much variation there was. I'm in my issue branch and I will go ahead and open up our studio. You'll see that I'm in my project route directory. If I go to Files, come down to this RMD file that we've been working on for the last few episodes and open this up. Again, this is the R Markdown document that we've been working on. You can learn all about it in the previous episodes and how we got to this point. But briefly, what we do is we read in some metadata about the taxonomy of the different genomes in our dataset, as well as the implicant sequence variants for each genome. We know how many times each implicant sequence variant appears in each genome. We create metadata ASV, which merges those two data frames together. Then we went about using the slice sample, which has a random process and so we set the random number generator to some arbitrary number, which happens to be my birthday. Here we created a threshold. We drew three genomes from each species. Then you can see that threshold is used here and throughout to then calculate and generate our output. I'll go ahead and run all this. I can do that by clicking on those green triangles. That all looks pretty good. You can see the output down here of the table of the overlap at the species level for our four different regions as well as with the plot that looks crummy here where everything is crushed together on the x-axis. Someone in one of the previous episodes asked about how we can change that. You can change that with the FEMA setting, but I'm going to skip that for now because when this gets embedded into our markdown document, it's a larger figure, things are more spread out and we don't have to worry about that so much. I'm trying not to worry about theming and appearances as we're trying to learn these other concepts as we go through our exploratory data analysis. What we want to do is we want to repeat this analysis a hundred times and look at how those lines change over say a hundred iterations. Is there much variation? Are we getting luckier unlucky in which genomes we pick? As I mentioned in the introduction, what we want to do is create what's called a function, and then we're going to use the map functions to iterate that a large number of times. I'm going to go ahead for just the beginning and create a new R script, or I will talk about how we create functions. The general syntax is that we have my function, which is going to be the name of the function. Somebody did this for like select or ggplot or any of these other functions we love to use, with the assignment arrow to the left, and then the function keyword, so a function is a function, use a function to define a function. And then inside of the arguments for the function, you list the arguments, right? So argument one, argument two, and so forth. And then we have curly braces. These are the ones that are to the right of the P above your square brackets. And then inside the curly brackets, you have what I call the special sauce, right? This is where like the sausage gets made. And then we return the final value, okay? So this is the general idea of a function. And I'm going to go ahead and put this all as a comment, because I don't want to run it. That's the general template. So let's create a function. So I'm going to create a function that calculates the square root. Now, there's already a square root function in R, but again, this will just help us to illustrate some of the concepts of creating functions and the map function. So I'll create a function called square root, right? That's the name of my function. I'll then say function, and then I'll have x as being the argument. And I've got my curly braces. And I'm going to then have sq underscore RT. And that is then going to be x to the one half. I'll say 0.5. And then we will then return sqRT, okay? So this is a very simple function, you might say. And if I highlight this and run it, I can then do square root 25, and I get five, right? It does what we want. Now, this is a very simple function. Some might look at this and say, man, you got a lot of code in here for very little action, right? So what I could do instead of all this is I could instead do x to the 0.5. And if I change it like that and rerun everything, and then come down and do square root 25, I'll get five. And so what's happening is that a function will return the last value that it calculates, right? So the last value that's calculated is the answer of the result of x to the half power, right? And so that's the same thing that's going on here. But what I had before and now have commented out is a bit more explicit. And I really encourage, if you're just getting going, go with the explicit over kind of the shortcuts, right? And so I'm gonna try to remember to use the explicit when we talk about our function. So anytime you change it then, you need to rerun it, reload the function, otherwise the changes won't take. And we can then say 25 or 26, and we get the expected result, okay? So this is how we make a function. Functions can be really complex, they can be really simple, but this is the general idea of how you make a function. And there's a lot more to be said about functions, but for now, this will do. Okay, so the next thing we might wanna do is we might wanna repeat this calculation over a large number of values. And so we can do that using the map function. And so with the map function, we give it a vector of values and the name of a function, right? And so what we could say is, let's give one to 100 as our vector of numbers. And the function that we're gonna give it is square root. And then in parentheses, we'll put a period. And we've seen before with like interjoin that the period means to take all the information that's flowing and plop it in where that period is. It's the same idea here. We're gonna take that vector of numbers and each number we're gonna plop in where that period is and evaluate our square root function, right? Ah, why isn't it happy? Ah, and it's not happy because I screwed up the syntax that there's a funny little change that we need to include a tilde before the name of our function. Then when we do that, what we get out is some output that's a little bit weird, right? You may have seen this in your previous work. I'm gonna reduce this down to five so that the output is a little bit simpler. And what the output is, we see with these like double square brackets is what's called a list. I don't use lists a lot in my work. I generally work in terms of like data frames and vectors, but this is a list. And so the default output from the map function is a list. There are special map functions that will allow us to output vectors as well as data frames. And so we'll talk a little bit about those today. So you could do map DBL, which means the output should be a vector of doubles. And so now when we run this, we get a vector like we'd expect, okay? So many of you are saying, Pat, Pat, Pat, you could have just done square root one to five and gotten the same result. And that is exactly true, but what I'm trying to show here with the map double and the map function is how we can take this function and basically repeat it over many values in a vector, okay? So this is the gist of functions and the map function. Now let's go see how we'd apply it to this stuff we've been working on all along. So I'm gonna go ahead and close that R script. So the first thing I need to do is I need to convert this code into a function, right? And so what I will do is, so the first thing I need to do is I need to name the function. And so I will say, get subsample result and the function keyword. And then I'm gonna give it a threshold. So the threshold is going to be the parameter that I pass in to my function. And I'm gonna start with an open curly brace and all this other stuff, that's the special sauce as I called it, down through my line 88 here before we create the plot. I'm gonna hit highlight and hit the tab key so it bumps it over and then I'll include that closing curly brace and I'm gonna output the overlap data frame. So I will then do return, I guess that's return overlap data. That's gonna be good. And so now if I highlight this, I've got the function loaded. And again, my function is called get subsample result. And I can use that, right? And so I can say, get subsample result, let's do five. And maybe what I could do is I can go ahead and pipe that into my plot. So I'll save this here and run this to build the plot and we see subtle change where the numbers, the lines come down a little bit. If I change it to say one, I think it takes a few seconds to run, but the curve should end at a little higher, right? So the V4 ends up at about 15%, right? Excellent. So we have our function for calculating the data for any threshold we wanna give it. So what we now wanna do is we wanna run this a large number of times, right? And what we will do is this is our function name and we will do map underscore and we're gonna use a special map, which is gonna be DFR. And so it's gonna output data frame, that's the DFRoseR. We could do DFC for columns to add columns to a data frame. I'm not sure when I've ever used that. DFR will add rows. So each output, if I look at get sub-sample results, say five, then this will output the data frame for a threshold of five, right? And so if I run this say 100 times, if I run it again, it'll generate another data frame. So what map DFR is gonna do, it's gonna take the output of these iterations and concatenate them together row-wise to make a larger data frame. Okay, so we'll do map DFR. And then our X, let's say we do one to 100. And then the function is that tilde, remember, get sub-sample result. And I'm not gonna give it, I'm not gonna do period because I'm not interested in using, I'm not interested in sub-sampling from one to 100 genomes. I wanna sample the same number of genomes, say 100 times. So I'm gonna do threshold equals three, right? And one other thing that occurs to me is that I have no way of knowing which iteration each chunk of rows is from. So what I can do is I can add a parameter.id and then say give it the column name for that column of iterations. And so this will create a column that has numbers from one to 100 indicating which iteration the data came from, okay? And so I'm gonna call this sub-sample iterations, right? Whatever, it's going over the line and that just kind of gives me a little bit of the chills. So we'll run this and so it's gonna run that threshold operation that we just created as a function 100 times. And it's gonna output it as a larger data frame called sub-sample iteration. So we'll run this, might take a minute or two and then we'll come back and look at the output. So that ran and took maybe about three minutes. Something that occurred to me is that generally when I'm testing things like this and making sure things work right, I wouldn't do the full number of iterations, I might just do three or four iterations to make sure things work, but it's okay. So if we look at the output of sub-sample iterations, we see now we have a column iterations, we have a region rank overlap and that all looks pretty good. What we could do next is we could do, instead, yeah, we could take our sub-sample iterations and we can think about how we wanna do the summary data. So we wanna group by our region and our rank, and then we're gonna do summarize and we can then say mean or mean overlap, right, and then we can do the confidence interval. I like to use the observed data to get my 95% confidence interval rather than using like a standard deviation or standard error. So I'll do LCI for low confidence interval and I use the quantile function to calculate that quantile and I do overlap and then I say PROB equals 0.025, so that's the bottom interval and then the upper confidence interval will be quantile overlap PROB equals 0.975. So the difference between 975 and 025 is 9.5, right? And so if we run all this, we then get our confidence intervals and for each rank and each region, which we can then remember to drop the groups and I'm gonna say summary overlap data, right? And then this summary overlap data, I can use as input to my GG plot, right? And the X is gonna be the rank, the Y is gonna be mean overlap, group region, color region. So if I plot this, I get my line, right? I don't have my confidence intervals and we see that, I think we used a threshold of three that the average for like V4 turns out to be about 3% and for V19 it turns out to be like one and a quarter percent. Okay, I'd like to get my confidence interval on there. So I could do geom error bar, yeah, geom error bar and for this, we need to give it a Y min and a Y max. So my Y min is gonna be the LCI and Y max is gonna be UCI and this gives us the hats so to speak on our confidence intervals. So I don't really like the way those hats look. So what I sometimes will use instead of error bar is line range and this gives a line showing the confidence interval. And truthfully, there's not a lot of variation in the data there, okay? So again, this is the whole flow for how we would sub-sample our genomes, in this case, three genomes per species, doing it a hundred times and then make the plot. Again, if we're testing things out, maybe I'll do 10 and let's try one and see what happens. And so again, what we'll do is that by running all this, we will then, I think I can run the whole chunk, right? Run the current chunk. It will run everything and it will then do it though, using a threshold of one and regenerate a plot using a threshold of one. It should, if it took three minutes before, I guess maybe it'll take 30 seconds or so this time around. So sure enough, took about 30 seconds to run, I think. And we see that for one genome per species, we have higher levels, similar to what we had before. And you can barely even see those line ranges for the confidence interval. It's very tight. One thing I'm not really crazy about here is the line range showing up in my legend. And so you can do show.legend equals false on line range and then that way it'll get rid of that vertical bar in our line range, but it's still there on our plot. Okay, so I've been thinking about it and I think this idea of requiring three or four or five genomes per species is probably not the right idea. I think if we're gonna do this large number of iterations, then why not pick one genome per species, see how often the ASVs for that genome are shared across other species, and then we can do the iterations a hundred times. I don't think we need to do a thousand because that confidence interval is so tight. So I'm gonna stick with the threshold of one. I'm gonna do a hundred iterations and that is what I'm gonna roll with. So again, this will take a few minutes to run. When we come back, we'll talk a little bit more about the output and we'll then go ahead and render the document and close out the issue. So it ran through, I think it's a little bit slower than using the higher thresholds, probably because it's including more species because every species present has at least one genome. One thing, two things I wanna point out. So first, I wanna make sure that you saw that me setting the seed was outside of the function. Also, we don't need this threshold equals three thing. So if we had set seed inside the function, then every time it ran the function, it would start from the same seed for the random number generator and you'd get the same output every time you ran the function. That's not what we want. We want a different value coming off the seed each time we run it and this is how we set that up. The other thing I noticed was down here at the end for our overlap data outputting the table, we want summary overlap data, not straight up overlap data. And so then this gives us our output with the region, the rank, the mean, the low confidence and upper confidence level. And remember, we can say digits equals two to give two significant digits to the right of the decimal point. I might even do one to clean up the output a little bit. So that's wonderful. We can change our conclusions. So we have now controlled for uneven sampling of different species, so we can get rid of that. Subregions are less specific at the species level than full length, really at every level, than full length sequences. Still full length has a 3.6, that was right. The subregions are actually a little bit higher than we saw before. I think these numbers were from using all of the data without controlling for uneven sampling. So between 10.2 and 15.0, probably want some hyphen there for subregions. Okay, so we'll go ahead and save that. At some point, I will show you how you can code in this number. So instead of having to write 3.6, it would take it directly from our summary overlap data, but we'll save that for another episode. So I'll go ahead and save this. I will then quit out of our studio. And then we will go ahead and close our issue. So I'll do get status. We've updated that. So we'll then do get add exploratory 2020, 10.21, that, RMD. And that's good. And get commit. So we'll say randomize subsampling 100 times using threshold of one, closes number 31. You'll recall that this is an issue we've been working on for a few episodes now. I'll do get checkout, master, get merge issue 31. And because we've now updated our timestamps, we're gonna need to go ahead and rebuild this markdown file. So I'll go ahead and do make on that. And this is gonna take a few minutes. So I'll go ahead and run that. And that actually says it's up to date, which I find very surprising. So I'm gonna, I think because it updated it with the version that I had on the branch. So I'm gonna remove it and then force it to remake it and that'll go. And so while this runs, it'll take a few minutes. One again, talk to you about what we've done today. And we talked about making functions and using the map set of functions to iterate our function over multiple values or to repeat our analysis multiple times as we did with our data for our analysis. Now there's many ways that you can use the map function. Places that I've used it with microbiome data is that perhaps I wanna do a say Wilcox test for every OTU. I can then map over all of my OTUs running that Kruskal-Wallis or Wilcox test or whatever you wanna do. That requires a little bit more sophistication in our use of map that frankly always tricks me up but you can use the map function. I know back in the battle days before we had these map functions from per I would use things like the apply function or I would use for loops in R. For loops have kind of a bad rap. They tend to be much slower because you're using effectively R to do the loops whereas the per functions like the map functions are written in C++. So they're written about as fast as you can get them. So if you can use those map functions you're gonna be running the analysis about as fast as you could. That being said, there are places probably within our function that we created today that we could have sped it up a little bit. At the same time as I've talked about in previous episode there's this trade-off between developer time and execution time. And I could spend maybe another hour trying to optimize that to make it faster but it's probably only gonna save me a total of like five minutes of my life. And so I'm gonna spend an hour to save five minutes and it's probably not worth it, right? So again, these are all things that you can think about and where you might optimize. But again, the focus of today was on that function, developing functions and then using the map functions to iterate that function over multiple values or to repeat that same function many times with different values coming off the random number generator. So think about how you might use the map function in your work. If you see a for loop somewhere and the analysis that you're doing, think, you know, could that for loop perhaps be converted to a map function? You know, perhaps the for loop works just great for you. And so that would be a great opportunity to practice using the map function because you know what the result should be and you can think about how you can convert that for loop into a map function to perhaps be a little bit more efficient in your analysis. So once this is done, I'll go ahead and push this up to GitHub. Please be sure you keep practicing and engaging in this material. This is the first episode you've seen. I hope you've enjoyed it. I really appreciate all the feedback people have given as well as all the watches people have made of the video and the thumbs up that I've been getting. Feel free to go back and look at the old videos which will perhaps give this one a little bit more context in each of these videos. I do try to go over one or two concepts to kind of move this general analysis along. So like I said, keep practicing and we'll see you next time for another episode of Code Club.