 Hey folks, my name is Pat Schloss and I'm the host of Code Club over the past couple dozen episodes I don't know how many at this point. We've been working through a project looking at exact and Amplicon sequence variants within the field of microbial ecology and how they're used I'm using this as a platform to talk about different practices in reproducible research We've talked about a lot of things like using the command line using make using version control using our using literate documents like our markdown and throwing it all together to look at this question of what threshold Should we be using to bin sequences together as a unit of inference for microbial ecology? Now again, this might sound like it's way far too into the weeds for microbiologists That you know if you're outside of microbiology, you shouldn't be paying attention But trust me you'll learn a lot as you watch about how we can use Reproducible practices to go through a data analysis plan and of course if you're a microbiologist who's interested in microbial ecology You might learn something about Whether or not we should be using exact sequence variants or Amplicon sequence variants So the goal of today's episode is to determine the threshold that we might want to use to get one ASV per genome and we could use exact sequence variants So where an ASV where the radius is zero Or diameter is zero or an ASV where perhaps the diameter is five percent five percent variation between any of the sequences within the ASV So you'll recall that in the last couple of episodes. We've talked about how we could build out those ASVs And today we're going to take those on and again what i'm thinking about creating is a plot where the y-axis Will be the number of ASVs and the x-axis will be the threshold And then each line would represent the number of ASVs that we see per genome For those different thresholds and then we could do some faceting by the region So we don't get overwhelmed with lines on the plot And the line that I want to draw isn't because we've got you know thousands of genomes in here Is I don't want to show the average What I want to show is the 95 confidence interval or the 95th percentile And so, you know if I see a threshold of say 0.03 and the number of ASVs is Two then that means that 95 percent of the genomes had Two or fewer ASVs per genome. Okay That's another way to summarize the data what we're going to see in the next few episodes really Are different strategies for thinking about all these variables. We have so let's get into the R There's really nothing new here except Thinking through the problem and seeing a lot of the commands we've seen in previous episodes In different contexts. So I've got my issue here issue number 35 I'm going to come over to my terminal get check out Our get issue, you know, actually, let me show you this other way We can check out a branch get check out hyphen b issue 35 This then does both create the branch and then checks it out So we've been doing get branch issue 35 and then get check out issue 35 get check out hyphen b Makes the branch for you. So it's a little bit more convenient. Um, I think I do both Both approaches and I don't know why I pick one over the other. Uh, you'll see it's red And that's because I've already put in a template for today's episode making this as an exploratory file with an r markdown document Before I forget, let me go ahead and open this Or create the rule for this in my make file. So I'll do add them period So here I am in my make file and I will go ahead and add it here. Um, so I think I copied it, but forget Um, tab that in and I want md and I need to add the backslash. So now my make file is good to go Um, I'll go ahead and close out Um, adam and I'll fire up our studio And so if I open up my new, um document The title I've given it is determine the threshold needed to get one asv per genome should update my date here The date being the 24th it's Thanksgiving week and I want my chunk output to go to the console as I'm developing this because If you've watched the other episodes, you know, I really struggle with this And what I've roughed in here are The code from previous episodes to read in the metadata as well as the easv data So we'll come back to this And so this is the overview that I've copied from the github issue And I want to start thinking about how we're going to do this, right? So I'll put in my code chunk and Trying to think through the problem would be to get a one genome per Um species, so we'll have to figure out how to do that. We've done that in previous episodes We then will want to, um aggregate data by Species or by genome Because they'll be the same thing for our purposes And we will then want to count The number of Easvs per genome Four so we want to aggregate by by species genome By the region, right? So v4 v35 v19 whatever and by threshold I think that's it and then we want to we want to count the number of easvs per Per grouping, right? And then we'd like to plot the number of um or determine the 90th or 95th Uh percentile Uh for each grouping So each um for each region And threshold And then we want to plot The 95th percentile Um as a function of um the region And threshold Okay, uh, so let me make sure it covers my issues. So create a line plot of 95th percentile for each threshold Um, um at a different number of operands per genome Okay, so that's the other thing that we have to remember Is that these genomes have anywhere between like one and 25 operands So I want to add to this also number of operands The rn operand per genome So we want to count that and then Determine the 95th percentile for each region threshold And a number of rn's Right um and here again Number of rn's And then another plot that we'd like to make um, so plot the Um, and maybe I'll indent this to give my Comments some organization. The other thing I want to plot is the 95th percentile um For the number of um Let's say e asv's For um those genomes with let's say seven copies um of the Of the rn operand All right And again one of the struggles that we have is that we have too many variables, right? Um, we have the region we have We have the we have the species right and so we're kind of aggregating across all species We have number of copies per genome or per species We have the threshold that we're interested in and then um Yeah, and so we've got these three or four variables And one strategy that we're going to look at in this case is this 95th percentile And so this will allow us to aggregate across all of the genomes And because some of our genomes have um Some of our species have far more genomes than others like the coli had like 900 or something And others might only have one we want a balance for that So that's that's one thing that we're going to do is this percentile And i'm not actually sure that 95th is what we want. That might be too stringent. It might not be stringent enough We don't know. Um In some ways, it's also good to know what it what it is, right? What percentile should we be using if we want to get um a You know one a e asv e asv or whatever you want to call it um per region per threshold um And and so that's also useful information to know All right, so we'll start with these two uh plots and of course in future episodes We'll we'll kick this issue around a little bit more to get better insight. Okay, so this is the game plan i'm going to come back up to My code here i'm going to go ahead and start by loading my libraries And as that's loading i'm going to think about What do I want to modify from our previous code to get it to work? Um, and you know what if I run these two lines, which is reading in my metadata I see that I get the genome ID and all this taxonomic information What i'm really interested in however um, I don't want to change any of that stuff is that i'm going to do a select um and Let's see i'm going to select to get um the genome ID And the species okay, and i'm going to So this then will get us our two columns i'm going to then Group by uh the species And as we saw a few episodes ago We can then do slice sample n equals one that will then give us One genome per species And we can then ungroup That and this should all work and because i'm using that random um slice slice sample. I need to set my seed This is fairly self. I don't know Differential or self-serving. I don't know what you call it, but I use my birth date 1976 06 20 june 20th 1976 um Maybe use your birthday. Maybe use one whatever so if I run that and then generate my metadata That then gives me uh 4774 rows, which is the number of species in the data set and I've got my genome ID and my species What I can use then is I can then as we see down here We can do our inner join to join those together. So we now turn to easv And let me make sure that this is Everything I want So we'll read in The easv column the genome the count the region the threshold. That's all good I'm going to remove these filters and selects because I want all that data. I'm not only looking at ESVs in this episode. So I'll go ahead and save this to easv We can then do our metadata. Ah You update the variable name run that and now we have metadata That should be es easv, but Make sure it looks good. And again, we've got our genome ID species The easv for that species the number of time that easv shows up in that species for that region and for that threshold All right. Um, and so this again needs to be metadata easv So one thing that you'll notice is that my threshold column is a character And that's because I've got esv's in here as well as numerical values Um, perhaps to clean that up. What I can do is I can do a mutate um on threshold And I can then say um What I'm going to do a recode So recode and we'll say, um 0.001 Sorry zero 0.000 Um Equals esv I think that works. Actually, you know what I think recode is the opposite of what You expect it to be based on the rename function. So let's do this. Um, and Run all that and see if it works Um problem with mutate input threshold argument dot x is missing. Ah, yeah, so I want to add threshold That and that worked and we now see that our threshold is 0.000 And again the syntax for recode is the opposite of rename. Um, That's kind of annoying, but whatever. Um, we still have this as a character type And what I'll do is Then add to this mutate where I'll then say threshold equals as dot numeric and then inside that threshold We could have done as dot numeric parentheses recode and all that kind of nested but It gets kind of hard to read and so now if we look at metadata esv We should see that this threshold column is a double. So we're good to go. That'll be very helpful for plotting Our threshold across the x axis as a continuous variable. Okay, so we've got metadata esv and We've already done the get one genome per species part. Maybe I'll pull this back up here to where I did that So right here And aggregate the data by species genome region threshold all this stuff So metadata easv and we're going to pipe that into this And then we'll do group by and we're going to group by Our region our threshold And our what? genome ID We could also leave in the species, but That's the basically the same thing as genome ID. I don't know that my analysis really cares what the species name is So, um, you know, maybe we'll Yeah, I don't care. I'm going to use genome ID And I think that's all I want. I don't actually want I need to figure out the number of rns per genome because I don't have a column for that So I still need to figure that out. So count the number of asv's per grouping and the number Of rns Per genome. Okay, so that's actually a secondary thing So we'll group that right And now we're going to count and so we'll do summarize And we will summarize the number of rns Per genome and that's going to be some of count right? um And our number of easv is going to be the end function. Okay, and so we run all this We will see that takes a moment or two And we see the region the threshold the genome ID The number of rns and the number of easv's. Okay. I want to go ahead and undo my grouping Um, so I'll do the dot groups drop again, we run that and the output here takes a moment And we see the region threshold genome ID And then the number of rns In that genome as well as the number of easv's so this this um genome has five copies of the six ns gene and it has four Different versions of that gene in that genome, right? So the question is That is at a threshold of zero But if we increase to a broader threshold does that number drop, right? And so we could look at that for this particular genome by doing something like filter genome ID equals equals that And this then would show us The number of times this shows up In the v19 it's got all of them here, but you can see In my output here that as we expand the definition expand the diameter of how we define an ample consequence, right? The number of unique easv's drops. Okay, and so that's what we want to look at but again across all of the genomes all the species Okay So Now we want to know the 95 5th percentile for each region threshold and number of rrns So what we'll do here is a group by And we will group by again region threshold And n rns Right And we will then Summarize And we will then summarize The upper bound as What are we kind of do so we'll do um quantile? And we will use the n easv column in our probe I will say 0.95 and that's going to get us the 95th percentile So again, this takes a moment or two to run Uh, it's complaining Um, which is kind of surprising. So let's look at what we're outputting here And so we've got n easv's Um, ah s i'm missing the s at the end silly things Um, so again, we run all that we'll get the correct output All this is running be sure that you've liked and subscribed to the channel Um, that way you'll know when the next episode is released. Uh, this is Thanksgiving week or may or may not release one on friday But hey, if you're a subscriber You'll find out you'll be the first to know and so what we see now Is that we've got our threshold and our number of rnn copies as well as the 95th percentile And so we see for this case um, that's basically linear uh, the number of rnn copies increases With uh, the number of copies in the genome But it's not easy to look at that table. I'm now going to go ahead and call this easv's by threshold and region And run that uh, so that we now have that um variable uh stored Um, something I forgot to do was to drop my groups So rerun that and we will now move on to plotting the 95th percentile As a function of region threshold and the number of copies And so we'll pipe this in a ggplot And our aes our x is going to be The nrns And our y is going to be what? the the um Yeah, not the threshold or yeah, we want to plot the 95th percentile. So it's going to be the upper bound Right and then our Color is going to be a threshold All right, and then we're going to make this a geom line And I'm going to facet it by region. So I'll then do facet a wrap And that is going to be a tilde region and let's do Yeah, that's cool. So we'll go ahead and run this And um, we get kind of crazy results and the problem is that I put threshold in quotes And now the problem is that it's treating my threshold as a continuous variable rather than as a discrete Or categorical variable. So for this case, what I'll do is I'm going to do as dot character Threshold and you'll recall that up ahead up above. We had done it by We we'd recast it from a character to numeric that will actually help us with the next plot in this case Um, uh, we actually wanted it as a character, but that's that's fine What I'm going to do actually because now that I see I've plotted this I've got all of The thresholds and it's really overwhelming to see in the output. So what we need to do is a filter On our threshold. So filter threshold And we could do a bunch of things like threshold equals equals zero Threshold and threshold or or threshold equals equals 0.01 or threshold equals equals 0.02 But that gets really long and tedious and painful what we can do instead Is percent in percent and you'll see from my typo that I just had that it's really hard for me to remember Not to use the pipe operator when I start typing with percents So what we're going to do is define a vector of thresholds that we want to look at so I'm going to look at zero Uh 0.01 0.02 0.03 0.04 and 0.05. Okay, so we'll look at Six different regions. It's probably too many. Um We might also instead want to look between zero and 0.01 So we'll look at this plot look at what it looks like and then we'll come back and reassess What threshold levels we're looking at but this should clean up the output a little bit And um, it does look better and we see What so this red line is exact sequence variance, right? So that's an esv 95 percent of the genomes have Many many copies, right? So if you have seven operons, then 95 percent of the genomes are going to have Seven different copies of the esv in the genome. So that's a lot, right? And if we look at the sub regions, we see that that goes down And again, if you look at higher thresholds or the higher diameter for how we're defining an amplicon sequence variant We see that that gets muted as well Maybe what I'll do to make this output a little bit easier to look at That will be to say n row equals four And I will make that taller. So I'll maximize that window Make it look a little bit more attractive and maybe what I'll do is I will go ahead and get rid of These higher to the 0.04 and 0.05 It looks like we do want to look between at these kind of hundreds place levels because It seems like there's still stuff going on at these levels and Again, you could kind of contextualize this however you'd like. The other thing I'm going to do to make Give more room on the right here is this as character threshold. I'm going to pull that out and make a mutate So threshold Is as dot numeric or as character, right? And we'll pipe that in and I can then remove this as character bit And that will then instead of saying as character threshold as the variable name in the legend It's going to give us threshold. And so we can see that like an exact sequence variant for v4 Still is 95 of the genomes at a decent level of Number of copies in the genome is still giving us three or four different ESVs, right? So it's splitting one genome into multiple taxonomic bins. And again, that's at a 90% confidence 95 percentile And so again, this is good because it It shows us the number of copies within a genome and that 95th percentile By our different regions and our different thresholds We can flip this in the second question by looking at the 95th percentile for the number of ESVs for those genomes with a defined number of copies per genome. And so we could pick seven And so what we'll do is very similar to what we did above So ESVs by threshold region And we will then say I'm going to leave out the filter for now and the mutate. Let's go to the ggplot AES and our x is in this case going to be threshold And our y is going to be the upper bound And we are going to color by region And We want to because I want to look at say seven copies. I'm going to go ahead and filter n our ends Equals equals seven and then pipe that into my plot. So let's go ahead and run that and see what this looks like Ah, I forgot a geome geome line So look at that And again for some reason. I don't know why I keep putting the color in quotes We now see our thresholds on the x-axis and our region Is a different each region is a different color. And so we see in this case That like v19 Never gets below two even if we go out to five percent and so even like a five percent otu 95 of the genomes are Getting split into Two two different bins and so that's not so desirable We see like v4 For the most part it does come down to One genome for like a three and five percent. I'm not sure what's going on up here, but you kind of see that as you Come down and you expand your threshold You're collapsing these esvs or asvs Into a single one per genome per species in in our analysis And so all these calls to use really fine thresholds um, it's really Really fracturing genomes into multiple bins and that's not biological, right? There's no reason why one genome would one part of the genome would respond differently than another part of the genome In your analysis that doesn't really make ecological sense, right? You know, my hand doesn't have a different ecology than my right hand, right? It's kind of a same in type of analogy One thing we can think about Is again, if I drop this probe to 90 and then rerun my analyses And looking at my plots that you'll see at a 90 threshold This does kind of dampen a bit As well as this one, right and so at a 90 confidence our v19 Does drop to one or two copies at that 3 threshold. Whereas all the other regions really Crush crash out at about one and a half percent to one copy per genome. So again I'm not sure that 95 percent tile is where we want to be at 90 percent Seems a little bit more generous ultimately it's really going to determine or depend on What you're trying to do and what gives you the the most taxonomic resolution Or the for your particular data set and so you'll recall that You know a lot of the things in our database are genomes that have been sequenced And so we're skewed by things that have genomes that have been sequenced As well as You know, those tend to also be more clinically relevant than perhaps environmentally relevant. So It all depends and again, this is a threshold of seven If I go back to a 95th percent tile run that and Let's look at four say like a genome that had four copies We see that even with a high threshold if I look at genomes that only have four copies Per genome that we do get down to one EASV for the entire genome if use a threshold of two or three percent It's certainly a much higher threshold and broader threshold than the proponents of asv asv's or esv's push Which is much lower down here, right? My conclusions Would include saying so i'm gonna Maybe I'll leave this here and I'll say Analysis depends on Our threshold or our Our comfort with uncertainty So prob equals 90 percentile or 95 percentile And the number of rn copies per genome, right and Kind of Regardless of the number of operons Or um region We need a significantly higher threshold Than es esv's or even um traditional i'll put that in quotes because Traditionals like a couple years Traditional asv's are defined at Frankly Zero point zero three percent or zero point i'm sorry three percent Doesn't look so bad for this type of analysis. I'll correct all my typos Someday I'll learn to type slower And more precise so there's a lot more we could do with this type of analysis, right? I could have bundled this in a package and I could have made a bunch of plots Um, perhaps I could have looked at different probabilities. So I could have picked um a threshold um Or I yeah, I could have picked Um, I could put put my probabilities across the x-axis my And my number and then perhaps my threshold on the y-axis and plotted the threshold that gives me one asv for each level of Uncertainty right So there's lots we could do but I think this gives us a pretty good sense of What thresholds we need to collapse the diversity of esv's down to a single one For genomes and genomes of different numbers of copies per genome So I'll go ahead and save this. I'm gonna Close that and I'll go ahead and make it so I'll do make exploratory 2020 11 24 And this we want to do md not rmd We'll generate that I'll go ahead and commit the uh change to the issue Um merge it to the main branch and we'll be good to go So over the next several episodes, we're probably not going to learn a whole lot of new r concepts But we're going to look at this problem that I talked about today Um, perhaps in different contexts the flip side of using a broader threshold to get fewer esvs per genome Then is that we increase the probability that the asv that we see in one genome Is also seen in another genome or another species And so something that we'll want to think about in the subsequent episodes is how do we balance that? And and perhaps what is more important to us? All right, so I'll go ahead and commit this And we'll see you next time for another episode of code club Please be sure that you like and subscribe to the channel Tell your friends about the videos and be sure if you're in the us or if you're anywhere Have a great week rest of your week. This is Thanksgiving here in the us and we'll talk to you next time