 I can't tell you how many manuscripts I've written where I've sat down, I've been writing just pleasantly feeling like, ah, we got this, we're so ready, we're going to submit this in a week. And I get midway through the results section, I think, crap, I don't have all the data I need. I should do this other type of analysis or this analysis I did before had some problem in it. I need to look at the data another way. Yeah. Well, that's the point we've reached in our manuscript. Hey, folks, I'm Pat Schloss, and this is Code Club. In each episode of Code Club, I try to apply principles of reproducible research to an interesting biological question. Well, as I mentioned in the intro there, we got into a point in the manuscript where we realized that, well, you know, we forgot to do something. And so you'll recall that perhaps if you've seen the previous episodes that back in November and December we were doing some exploratory data analyses and where we would draw one genome from each species. And we did that, say, one iteration. But what would happen if we, you know, got lucky or unlucky in which genome we grabbed from each species? Yeah. Well, we need to repeat that 100 times and then report kind of the median or the average of those 100 iterations. We touched on this in the last episode where we talked about where we kind of put our heavy lifting. Well, we have another point today where we're going to do some heavy lifting in an R script so that we can look at the propensity to lump or split ASVs into the same OTU or in the same OTU and whether that OTU spans multiple species or whether that OTU or that ASV could be used to split a single genome into multiple bins. If all this OTU ASV doesn't make sense to you, don't worry. Stick around and you'll see some cool application of R. Another thing that I want to point out is that some of you have caught on to the method of my madness, so to speak. And that what I try to do as we're working through this project is to show you all of the tools I use as I use them. And you're realizing that it's not just straight R, that we're using bash, we're using make, we're using version control, we're using R and we're tying it all together to do a cool project. Even if it's as simple as this project is, you're still using a lot of tools and bringing them together. The other thing that I want you to know is educationally, if you see the same concepts presented in different contexts, you'll learn the material that much better. Because when you see it in a new context, what your brain does, it says, I've seen this before. I think I know what I'm supposed to do, right? And then it tests that, right? You form a hypothesis, you test it, or you see me test it for you, and then you make a prediction about what will happen. And that process helps to reinforce what you know and you learn the material that much better. So if some of this material seems a bit redundant from episode to episode, well, first of all, this is real life. I'm writing a real paper. I'm not making stuff up. And second of all, yeah, that's part of the design of how I'm trying to teach people how to program is that I want it to be integrated across multiple tools. And I wanted to feel redundant because if it feels redundant, that means you're learning the stuff and that you know what's going on. So good job, good review. All right. Let's get back to our project. We need to look at the propensity of lumping and splitting ASVs into the same species or into multiple species, or to split a genome into multiple different bins. I'm at my shell window, and I'm going to fire up our studio. I need to get my project root directory. And we're opening our studio. The other thing I'm going to do is I'm going to return to my GitHub page and go to exploratory. And the analysis that we're interested in from this exploratory data analysis is the lumping and splitting. So I'm going to open that up. And you'll recall that this was how I presented faceting and how you can make faceted plots, right? That we had these plots down here where we had different regions of the 16S gene, the different thresholds we used, and whether we were lumping species together with the same OT or ASV, or if we were splitting genomes apart for different definitions of an ASV or an OTU. What I'd like to do is steal some of this code. And we will start by grabbing this like we did in the previous episode. I'm going to create a new R script file and paste that in here. And I'll go ahead and remove the here and knitter. I'll save my set seed there for now. I'm going to grab this code. And this code is what we use to grab one genome per species. And we'll read in our metadata or I'm going to call this ASV instead of EASV. And this mutate line I'm also going to put in here. Or we change the threshold ESV to be zero, as distance is zero. And so we can read all this in. I couldn't find here, so I need to remove that here. Again, we're writing these paths relative to the project root directory. I find that, I guess here would probably work in this R script. But it's perhaps more useful in an R Markdown document. I'm going to be running my R script from the project root directory using make. So I don't need to really worry about where I am in the file tree and the directory tree like we do with an R Markdown document. So I'm going to go ahead and put this into code. And I will do get lump split rates dot R and so we've got our script. And again, what we're going to do is we saw part of this in the last episode. But we'll review it here. We're going to take metadata, we're going to group it by species. And then from each species, we're going to grab one genome, we'll ungroup that. And we will then feed that into an inner join with ASV. And if we run all that, I've got an extra pipe at the end. So I got that plus sign. So get away from that plus sign. If I hit the escape, goes back to a good prompt and get rid of that pipe. Better data not found. And you know what, I probably forgot to run everything. So we'll do that and we're good, right? And so we've got one genome per species. And we can see here as an example where we've got one species, but two ASVs for the V19 region. Okay, let's go back to our exploratory data analysis. And we can look that we had splitting data and lumping data. Lumping data, splitting data, and then lumping splitting data, where we pulled it all together. So I'm going to go ahead and grab all this code. And let's see, let's go ahead and grab that too. And I will then paste this into my R script and we'll make sure that all works. So what we can do is we can then, let's start with the splitting data. And for now, I'm going to remove these comments because they're kind of in the way, or they're making it harder to read what's going on. Comments are good, comments are your friend. And so we're going to group by the region and the threshold and the genome ID. And to measure the splitting, we will then look at the number of ASVs. And whether or not it's split. And so if it's split, then we'll do greater than one. And then we can then, yeah, so then if it's split, we'll say it's true. Split, we can then group by the region and the threshold. And then we can do a summarize with F split. And we will be some across the cases of where it's split and divide by N, which is a function that counts the number of entities in that group. And I meant to run the whole code chunk. This might take a couple seconds to run. And ESV is not found, yeah, because I changed that. And I will go ahead and run this. Give it a moment, and we should get the right result back then. And so we see that we've got our region, our threshold, and the fraction of genomes or species that are split into multiple bins at each threshold, right? So as you increase the threshold, the likelihood of splitting a genome into multiple bins drops, which is what we see. And that is good. I'm going to call, I'm going to put this as part of a function. My function, I'm going to call get rate of splitting, and we'll say function. It doesn't need to take any arguments here. And we can tab that over, and then we can output this, right? And I can then, it's upset about something here. What's it upset about? I forgot my open curly brace up here, I don't know how. I must have accidentally deleted that, so let me get rid of that. So we'll load that, and again, if I do get rate of splitting. Okay, again, so that worked well, and I don't know, if I look at a threshold of 2%, what is the fraction split? It's 0.0501, if I come up higher, I see that I've got 0.0503. So you do see a little bit of variation depending on which genomes are sampled from each species. Good, so we've got our rate of splitting function that we will then call on here in a moment. Let's move to our degree of lumping. And this is going to be very similar in that instead of metadata EASV, what we'll do is all this stuff, right? So we'll pass our inner join and that we can then feed into our group by, right? The code for measuring, lumping and splitting is fairly similar, except instead of looking at grouping by a genome, we're going to group by our ASVs. Okay, so that seems good. And yeah, so it ends here. So let's go ahead and run this code block and see that we get the results that we expect. And sure enough, we get the results and again, it only shows you the first 10 rows, but there's a total of 104 rows. And as we can see that as you increase the threshold, you increase the lumping, which is what we see and is good. Okay, so let's turn this into a function. This one we'd call get rate of splitting. This one we will call get rate of lumping. And we will wrap this code block in those curly braces to make it a function. And we again now have that function and we can double check that the function works by doing get rate of lumping. And again, this will take a moment and we can compare the output of this to what we got before. And so here V19 at a 2% threshold, we get 19.5% of the species are lumped by a single OTU. And here we see 0.197, right? So again, the effects of randomization, which is why we need to do this a bunch of times. And so what we can do is I'm going to call both of these using the mapDFR function. And so we can do one to three and we will then give it tilde get rate of splitting. And I'm going to call this rate of splitting on that. And we can do the same thing on rate of lumping and get rate of lumping. And this will take a couple moments, maybe a minute or two to run both of these lines. But once it is, then once it's finished, we'll come back and we will group by our region and our threshold to get a median F lumped or F split, as well as the IQR to see how much variation there was in the data. A little impatient. So I will I'll go ahead and start working that out. We will do a group by and we'll say region and threshold and pipe that to summarize. And we will then do what? Let's say split rate and we will then say median. And it is F split. And let's also do split IQR as IQR F split. And then dot groups equals drop. That will be good. And it should be the same type of thing for lumping. So I'll go ahead and put that there. And we will do F lump double check. What is the output? What did I do? Rate of lumping F lumped. Yeah, I knew that F lumped. Bam, bam, bam. And let's see, I can tack this onto rate of splitting to make sure that we get good output. And so sure enough, we see the split rate for those three reps. And then again, for all these values, the IQRs are really small. And you might think about plotting a confidence interval, but it's so small that it's smaller than the thickness of the line. So I'm not not really concerned about the IQR values. Okay, so again, we get our rate of splitting rate of lumping. Those should work. Let's go ahead and run those. And the next thing that we will worry about is joining those data frames together. So lumping splitting data. And so I'm going to, I called the splitting data before, but now rate of splitting rate of lumping. And we will concatenate that together. So we'll join those two data frames together. And again, one thing I want to point out is that normally, if we're joining two data frames on a single column, we would say like buy equals quote region. We did that up above here where we did interjoin. It's a little bit different because in the two data frames, the column names are different, but it's the same idea. And so here we don't have an equal sign because we're not joining on region in one data frame and threshold in another data frame. Instead, we're taking both data frames and we're joining them on both columns, region and threshold, so that the length of the output data frame should be the same as the length of the input data frame. Okay. So that's done. Good. I just ran out of things to say. So let's go ahead and run this interjoin. See what it looks like. And again, we see we have 104 rows. We've got region threshold, the split rate, split IQR, split rate, split IQR. And it occurs to me that I had split rate rather than lump rate. And so I need to run this again. And again, this needs to be lump IQR, let that run. But something I want to point out is that what you noticed is that I had the same column names in both data frames. And what interjoin does is it'll put a dot x or a dot y if you have the same column in two data frames, but you're not joining on those columns. Right. So this was from the left. This is from the left. And then this is the right. This is the right. Okay. So that finished. And let see what our interjoin looks like. Make sure that this looks good. Yeah. And we've got region threshold, split rate, split IQR, lump rate, lump IQR. We're in good shape. Okay. And we could then output this, right? And I don't need this pivot longer. So we did this pivot longer to output to generate the faceted plot to make it tidy for plotting purposes. I'm not so concerned about doing that right now. And in fact, I don't need that. I don't need to define it as a variable. But what I do need to do is write this out to a data frame. So I'll do write TSV and we'll put it in data processed. And I'm going to call this lump, lumped splitting, lumped split rate TSV. And that will be good. And the other thing I'm going to do is I'm going to increase, or let me make sure this runs and does what I want it to do, right? And so we can then look at, let's do nano data processed, lumped split rate. And we see our values here. We're in good shape. Yeah. And so we can output that. So now we're ready to scale it up, right? Like, I didn't want to test this out with 100 iterations. That would take a long time. So instead, we did it with three iterations. I'm going to up to 100. And it was pretty quick to run, but still, you know, it could be faster. So we'll use the fur package, which calls future map DFR. And we then need to do library fur. We need to get rid of that set seed. And we also need to say plan. And I think it is multi core. I again, I don't use this package a whole lot. And so I always have to look that up. So it's perfectly fine for you to forget things. The key is to remember where they are. And so I'll look at this get rock data. And you'll see that I had library for there. And down towards the bottom here, I have plan multi core. Was that what I had? Yeah, see, I'm learning. The more I use it, the more I, you know, this is the hypothesis that was plan multi core. And I know that I need in here, I need to say like dot options. Something about the set seed, but I forget what it is. And so where was that that was down here. Dot options equals fur options seed equals my birthday. Okay. And I can again, put this all on multiple lines. So it doesn't scroll off the right side of the screen. And that all looks good. And I've got an extra equal sign in here, which I don't like. So we'll save that. And again, we can do the same thing down here with get rate of lumping. And again, we'll add the options. On the fur options seed 1976, 0620. It needs the seed because when it sets sends this to different processors, it needs to know what the seed is on all those different processors. If you would have done it the way we had it before where we did set seed, and then you didn't use for options, it would give you a warning message, which I don't like warning messages. Okay. So you know what, I'm going to set this to three to make sure everything works fine. So I'll save that. The other thing I'm going to do is I'm going to fire up my make file. So I'll do add them period. And I'm going to come to the towards the end of my thing here where I had, where did I put it? Yeah, data processed. And what did I call that file? lumped split rate TSV. And that is code get lump split rates. R R and then we'll go ahead and put in the dependencies, which you know what they're the same things as this. So I can copy and paste that down. We should be in good shape. So now I can test this with three iterations by running make and the file name. I forgot to put in the shebang line and I forgot to make it executable. So I will do chmod plus X on the code get lump split rates that are I need the shebang line, which again, I sometimes forget. But that's a user bin ENV vanilla R script vanilla. And again, what this line does is it bash will see this line. It fires up the R script function, which is running R. And it uses vanilla. So it's not using any extra files, external files to launch R. And it only uses the files that are in here. Sometimes people have their R environment set up. So all sorts of other things are happening. Vanilla ignores all those other things using vanilla is the most reproducible way, most portable way, portable way to work with these R scripts. So let's try that again. I'm noticing that I think I used the wrong automatic variable. It doesn't matter because it's sending out all these other arguments. I'll fix that for the next go around. Right now, I'll go ahead and make that a less than sign. And I can also use the less than sign here. The less than sign uses the first argument. The carrot uses all of the arguments. It doesn't really matter because I'm not parsing out the different arguments. Okay. So that worked ran nicely. Let me look at data processed, make sure I've got my lumped split rate. And process lumped split rate. I've got my values here. They all look good. Everything looks nice. Now we can scale it up. And so I will come back down here. And I will say, let's go one to 100, one to 100, save that. And we will fire this up, let this run, it might take a minute or two, and we'll come back and we'll then be able to insert this into our markdown document. Great. So that finished. And again, we can do nano data processed. And we can see our different regions or thresholds. That all looks good. Again, as we increase the threshold, the split rate goes down. And as we increase the threshold, the lump rate goes up, which is what we see. And again, our IQR values are quite low. I need to add this lump split rate file as a dependency for my our markdown document that we're rendering. And we are ready to incorporate that into our manuscript. So I'm going to go back to my submission directory, my manuscript.RMD file. And again, this middle paragraph that we were in for the last episode talks about what threshold you would need to get one OTU or ASV perhaps from each genome or each species. We did it for seven copies, but we still have this XX percent at the highest level of resolution. I'm not sure what I meant by that. XX percent of the species had a shared had a shared a 16S RNA gene sequence variant with another species. So I think what I want to do is I want to say, at these thresholds where we got one copy, you know, 95% of the time you got one copy for those values. What, you know what? No, I don't because this has different thresholds for different numbers of copies. So maybe what I should do is for at the highest level of resolution, which would mean that there's like no difference between the sequences going into a bin. So they are true ASVs. Even when considering ASVs, blah, blah, blah species shared an ASV. Maybe that's what I meant. Maybe I'll worry about the wordsmithing later. When I'm editing, but at the level of ASV, I want to know what percentage of the species shared a 16S sequence variant with another species. And then maybe I could say at a 3% or at the commonly used 3% threshold, xx percent of the species shared in, let's see, shared a x% of the species shared an OTU. Good. And so we need to now insert that information. And what we can do is we can then say lump, split, read TSV here on that. That file. And as we learned last time, we want to make sure that we we set the columns, but I forgot I need to rerun everything from all the previous code chunks. So we'll get that going here. And now we can do that. And I think the region is going to be called character. So basically what we had up here, right? And so I'm going to do region or sorry, call types equals calls. And we'll do region equals call character, close parentheses. And then we will not that one. And then dot default equals call double. And so now let's look at lump, split. So if you're if you're my age, when you were in college, there was a song called lump by the band, the presidents of the United States of America. So whenever I hear lump in the study, that song is playing in my head. Shows my age, huh? All right. So if we look at thresholds of zero. So if we do lump, split, filter, region equals equals zero. Nothing. Oh, not region meant threshold. Where'd you go? We get that. And also this this paragraph, the sentence I put in, it doesn't allow for multiple regions. And so maybe what I should do is, let's see, the species shared when considering full length gene sequences, and xx percent, xx, xx, and xx percent when considering the v4, v3, 4, and v4, v5 regions, okay, regions respectively. Good. And so now what we're ready to do is if we do this, we want to get the, at this level, we want the lump rate, right? And so we can then say lump, we can say read of threshold equals zero region equals equals v19, and pull lump rate, get that, and we want to multiply that by 100. Good. So again, I use the console to kind of help myself to develop these variables. And this is going to be lump v19 equals that. You know what, I'm probably going to do, just to keep this as a one-liner, I'm going to do a filter threshold equals zero up here. So I don't need to include it down here. And again, I can look at my four regions, four, five, four, three, four, four, five, yeah. And we can then put that down in here and let's see. So that's going to be our format PCT on that. And I'm going to copy this for those, what do I do? Those three X's down here. And then I will update the text. So I'm not getting the v19 for everything, but v4, v34, v45. Good. And spell check. At the commonly used 3% threshold, shared an O2 when considering full length sequences. And I'm going to copy this sentence down. Bam, bam, bam. We might later decide that, I don't want to put this here. We might later decide to clean up that text a bit, but it's okay. So I'll do threshold zero and that. And I'll repeat this for the v3, for the 3% cutoff. So I'll call this lump3 on that. And so this is where kind of the text, the code gets kind of messy. But it's okay. So I'll go ahead and run these three lines. And it's unhappy for some reason. Oh, that's because I modified that. So okay, it's still unhappy. So lump split is there. Oh, I misspelled threshold. That's one way to save characters is just, you know, don't use them. So if I do lump3v19, I get 3.58. Good. And so now I'm going to replace those, the last sentence. I'll do lump3333. And this will, of course, be the same thing, but v19. Okay. Good. So I'll go ahead and make the manuscript and see what this looks like. As this is running, please be sure that you like the video. It helps other people to find this. And I want more people to find this because I think, I think this information is useful and has value for helping to make all of our research more reproducible. We'll take a look at the manuscript and we should have all the values in our results section filled in now. And so we've been working today in this middle section and we see good stuff, right? Feel like I put in the wrong number here because I've got the same thing there as there. So let's see. I miss my lump3v19. My code is wrong. 0.03. Okay. So we'll rerunder that and we'll be in good shape. Oh, I didn't actually run it though. So again, thank you for spending time with me today to go through this process of taking a step in the analysis that we did one time, but then replicating it 100 times so that we can look an account for variation due to which genome we're randomly selecting from each species. Again, this is another example of doing the heavy lifting in an R script outside of the R Markdown document. When we started using R Markdown documents to write our papers, we put all the heavy lifting in the R Markdown document. That was a royal pain because again, if we wanted to edit a word or change something to be bolded or italicized or any little modification, we had to rerun all that heavy lifting code and none of that actually changed. None of the output changed from running that heavy lifting. So why not encapsulate that outside in a separate file like we are here? As I've mentioned before, the other advantage of moving this heavy lifting or these really intense computes outside of the R Markdown document is the output of that can then be used for other tasks like building plots of the data. And I'm pretty sure that this data that we generated today will see it again when we go about making plots showing the rates of lumping and splitting of our data. Anyway, this is all in good shape. After I sign off with you, I'll go ahead and commit and push my changes to GitHub. But please tell your friends about Code Club. Keep practicing these skills. You see the same types of things. Group by, summarize, inner join, read TSV, write TSV over and over again. And again, the beauty of this is that we see it in different contexts. It's so versatile. And again, that versatility and the ability to use it in different contexts and see it in different contexts is really going to help you to learn the material that much better. All right, so I'll let you go. Have a great day. And we'll see you next time for another episode of Code Club.