 Welcome back for another episode of Code Club. I'm your host, Pat Schloss. One of the things I love about analyzing data is that I can use a small set of tools to answer an endless array of questions. For the previous episodes of Code Club, I've really tried to highlight on one primary concept per episode. We've covered a lot of ground. Be sure you go back and check them out if you're just joining us. Besides learning a new concept, in each of those episodes, we've been advancing a larger analysis looking at the ability of Amplicon Sequence Variance, or ASVs, to differentiate between different species of bacteria. Now, for today's episode, I'm going to revisit the content of many of those episodes and spend some time showing you how I think about problems. When I have a particularly challenging problem, or a problem that will require many steps, I like to create an outline to help me think about my strategy. It allows me to separate how I'm going to do things from what I'm going to do by focusing on the what. This outline can be written in code, but more often, it's written in normal English. We'll call this outline of steps pseudocode. It serves a similar purpose to an outline that we might use to write an abstract or a paper. It helps us focus our efforts, identify the parts that we're unsure of, and gives us a map for where we want to go. Across the previous episodes, we've already seen that a single bacterial genome can have multiple ASVs, and that depending on the species we're looking at, there may be as many ASVs as there are genome sequences for that species. To me, this says that ASVs pose a risk of unnaturally splitting a species and even a genome into multiple taxonomic groupings. In my book, that's not good. Today, I want to ask an equally important question. If I have an ASV, what's the probability that it's also found in another taxonomic group from the same rank? In other words, if I have an ASV from bacillus subtilis, what's the probability of finding it in bacillus serious? Of course, it's probably more likely to find bacillus subtilis ASV in a more closely related organism like bacillus serious than say Escherichia coli, which is much more distantly related. Perhaps we can control for relatedness in a future episode, but for today, I want to show you how I would go about answering this question for any two taxa from the same rank. Even if you're only watching this video to learn more about R and don't know what a 16S RNA gene is, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please be sure to check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the notes. I've already gone in and created issue 31 to measure the specificity of ASVs for each taxonomic rank and region of the 16S RNA gene. We can imagine that the specificity might vary according to the region. And thinking back to some of the earlier episodes, we looked at the overall specificity of genomes and whether or not the same ASV was found in multiple genomes. And we saw that for the V4 region, there was less specificity of an ASV per genome but more specificity for the full rank. And so we want to kind of expand that analysis instead of looking at individual genomes. We want to look across all taxonomic levels. So we'll do that now. If I come to my terminal, I've already created an issue as well as a R Markdown document. And I will go ahead and open up R Studio. We see that we're in our working directory. If I look over at my files tab, I see my R Markdown document here and quantifying the overlap of ASVs between taxa. And so that's good. In this code chunk we've been using in most of our analyses lately. Again, what it does is it reads in the metadata that we have from the RRNDB about each genome. How it also contains our taxonomic information. And ASV then also reads in our information about which ASVs are found in which genomes as well as how many times they show up in each genome. And what we're really interested in is this metadata ASV. I'm going to go ahead and run this code chunk. That's good. And the question then is how often is the same ASV found in multiple taxa from the same rank? And I kind of already talked about this in the introductory remarks, but you'll recall that we have metadata ASV. And looking at this, we have many columns. We have the genome ID, the kingdom, the phylum, the class order family genus species, strain, the region, the ASV and the count. So I'm not so interested in the count, nor am I interested in like the strain or the genus. And what we really want to know is how often is that ASV found in different taxa from the same rank? So if we look at species, are there multiple species that have the same ASV? So again, how do we think through this? So I'll go ahead and create an outline and I'll use these pound signs. I know I haven't used a lot of comments in our code and I'm guilty of that, but what I'm going to show you today with our pseudocode is a very straightforward way to force yourself to have commented code, right? So for each taxonomic rank, right? So like phylum through species, I'm going to group by the region. What am I going to group by? I'm going to group by my ASVs. And I basically want to know how many taxa in that rank does the ASV appear in, right? That's the overall question, right? And so again, this is the goal. I've even got my favorite typos. This is what we're going to do, right? Okay, so we're going to group by our ASVs and then what I want to know is how many taxa in that rank does the ASV appear in? So that's kind of like a big question, but like maybe a sub step that we might think about is that we would want to know, yeah, so can we group by ASVs and then for each ASV, can we count the number of taxa, right? And then can we count the number of ASVs the total number of ASVs as well as count the number of ASVs that appear in more than one taxa. And then I want to get, find the ratio of the number of ASVs in more than one taxa to the total number of taxa. Or total number of ASVs, right? All right, and so that you can see we've taken that initial question and fleshed it out more, right? Of course, I don't want to just do it for each taxonomic rank. I want to also do it for each region and taxonomic rank. The other thing I'd like to do is to create a plot showing the specificity at each taxonomic rank. Taxonomic rank for each region, right? And so my x-axis is going to be the taxonomic rank. And when I say rank, I mean things like kingdom, phylum, class, order, family, genus species. Y is going to be the specificity or percent of ASVs found in more than one taxa. And my geome that I'm going to use is to do a line plot. And again, I'm going to have different lines for each region of the 16S gene, right? So again, this gives us an outline of where we go, where we want to go and what we want to do. And as I've already shown you with metadata ASV, it's not quite in the format that we want it, right? So let's go back and kind of think also about how we're going to get metadata ASV into the shape that we want it in, right? So metadata ASV is our input data. And we're going to want to focus on taxonomic levels, ranks from kingdom to species. And ASV and region, right? And so you're perhaps thinking, ah, we'll do that with the select function, right? So that's good. We will then kind of make our data tidy, make data frame tidy. We'll also want to remove lines from the data where we don't have a taxonomy, right? So if you remember, I like this Gamma Proteobacterium HDN1, like, let's see. So it doesn't have a genus or order or family name. So those are going to show up in our data frame as NA values. So we'll want to get rid of those. And we'll want to get rid of, we'll want to remove redundant lines, right? So we really only want to focus on unique pairings of our ASV and our taxa name for each rank, okay? And let me go ahead and kind of indent these. I don't know, I like having structure to my outline because it kind of tells me something about the flow and then this will flow into here, right? So this is my pseudocode. This is what I want to do and this is what we're going to do today using all the concepts we've covered in the past almost 30 episodes, right? So what we can do is we're going to pipe metadata ASV into this next line. And again, that's what I'm going to put here. And we're going to use select as we expected. And I'm, instead of listing out all the taxonomic levels that I want, I'm going to say the things I don't want. So I'll say minus genome ID, minus count. And was there something else? Minus strain, right? And so I can run these lines and I then get kingdom file and class order family. And if I scroll to the right here, genus species, region ASV, great. So that's the first step, right? And look at that. We already have a comment on what we want. Now we want to make our data tidy. So I'll do pivot longer. And the calls, I'm going to join on. Again, I'm going to do what I don't want to join on because that's shorter. Again, we've talked about this before if you want to be positive or negative in what you keep or what you remove. So we're going to join everything except for the ASV and the region. And the names too will be rank. Or no, it's going to be, yeah, the names of the column will be the rank. The values too will be the taxa, taxon, okay? And so again, running that, we then get that we have our region, our ASV, our rank, and our taxon. So we're in good shape. Now we see these NA values and I think that's the next step that we want to remove, right? So be sure we add that pipe and then we'll do drop NA on taxon and running that. We then see that we've lost those rows that had the NA values for the taxon. So we're chugging along, right? And so we can hopefully see that by having this outline that allows us to kind of go through our analysis much quicker and perhaps more confidently because we've already thought about the strategy of where we want to go, right? So we also want to remove the redundant lines. I don't know if we can so easily see this here. Yeah, so yeah, I probably can't see it so easily here but we see that there's 781,000 rows here. A lot of those are redundant where it's the same taxa and the same ASV and the same region and everything, right? And so we can use distinct to get the unique rows. And we see we're down to 326,000 rows. So we're really only working on those unique rows. Okay, so now we're gonna think about for each region and taxonomic rank we wanna group by the ASVs. So we're gonna do group by and I'll group by the region and the rank and the ASV. And again, so that's going to effectively create sub tables within our overall data frame. All right, so for each ASV count the number of taxa I'm gonna pipe that into a summarize and I will then say n taxa equals the n function and this will tell me for each ASV how many taxa does it appear in? This will take a couple of seconds to run while it's running. Remember to go ahead and subscribe to the channel if you haven't already and click on the bell icon so you know when the next episode is released. And again, it's telling us that as we talked about in a previous episode that it's regrouping the output by region and rank. And so what I'm gonna do is with this summarize I'll say dot groups equals drop last to make it explicit what I want. So that's great. And so now we're looking within each region and each rank and where did the output go? All right, it showed it down here below. And so we see the region, the rank, the ASV and the number of AS, number of taxa that each ASV is found in. Okay, so what I'd like to know is what's the number of ASVs that appear in more than one taxa? I think initially I had this highlighted step first but at this point I think it's gonna be easier because what I'm gonna do is I'm gonna pipe this to a count function where I'm going to count the n taxa, right? And so we run this and so what we see now is the region, the rank, the number of taxa that the ASV is found in and then the number of taxa where the ASV or the number of ASVs, right, the number of ASVs that are only found in one taxa and the number of ASVs that are found in two taxas, right? Taxa. And so again, this number 48 indicates there's 48 ASVs that are found in two genera but there's 23,085 ASVs that are only found in one genera, and so that's great. What we'd like to do now is that we wanna count the total number of ASVs and find the ratio of the number of ASVs is more than one taxa, is in more than one taxa is the number of ASVs. So what I could do I think, yeah, and so, and we're currently grouping by our region and rank and we will then pipe this and I'm gonna do this next step as a summarize and what I'm gonna do is call this overlap and the overlap, if I kinda look down at my columns here, minimize that, is that overlap is gonna be this, so it's gonna be the number that are found in more than one taxa and so that's gonna be some, we can sum n taxa greater than one, so if it's greater than one it's gonna be true, true has a value of one and then I wanna multiply that by n or the n column, not the n function because I want that number, right? So for v19 class I want the value to be one in the numerator, the denominator I'm gonna want to be 23,119 but for family I want it to be four in the numerator I guess it's gonna be 23,119 plus one, right? So that divided by the sum of the n column, right? So we're gonna look at these two values and basically the output should be one divided by 23,120 and we run all this and I'm forgetting a parenthesis at the end here, I run that unexpected symbol, I'm not sure what that was about but rerunning it, we then get all these different, the region, the rank as well as the level of overlap. I see that I'm getting two values for each of those. So I think I know what's happening, I think I've got a problem with parenthesis here that what's happening is that I'm outputting a data frame that's the same number of rows as the sub data frame, right? And so if class had one and two, then that's because this is gonna give me one number, right? So this is gonna give me a zero or one and then the n is gonna have two values. So in this first case for class it was like some really big number and then like one, right? And so I think what I really wanna do is I wanted to do this that because the order of operations scares me, I think I was just missing a set of parenthesis there. So let's go ahead and give this another shot and I think we should be in good shape. And sure enough, yeah, now we have the region, the rank and the overlap for each region and each taxonomic rank, okay? Very good. Again, the trick there was the problem was my problem of parenthesis that overlap was as long as the input data when it really should only had one row of output. So I'm gonna go ahead and save this, or write this to a variable, all called overlap data because I wanna use this to build a plot and as we say down here, right? And I've kind of given myself the specifications. So we'll say overlap data, I'm gonna pipe this to ggplot and then my x is going to be rank, my y is going to be the overlap and my group is going to be by region and color is gonna be by region, right? And then geomline, that should be good and let's give that a run and we see kind of an ugly plot over here. I'm gonna go ahead for now and go ahead and put this back in line. Let's see where to go. There we are. And so we see that these levels are out of whack, right? And so just because we write pseudocode doesn't mean that that pseudocode is like written in the stone and that's exactly what we have to do. It helps us get maybe 90% of the problems ironed out and of course we're gonna have problems figuring out what the code should be as I showed up here on line 55 and also we're gonna identify new problems, right? Like that our taxonomic levels are out of order. So what I'm gonna do is I'm gonna add a mutate line back up here to mutate rank to be a factor and I'm gonna set levels as kingdom, phylum, order, family, genus and species. Run that and then we can come down and run this other bit of code here and we now see that we've got kingdom flat. We've got our levels right in our regions. We could go in and make this look a little bit more attractive but for now I think this looks really good and what we see is as we kind of saw earlier with the genome data that the V4 and V34 and V45 data really are far less specific than the V19 for a species and genus. Once we get to like the order level and higher, there's really no overlap that they're very specific to those taxonomic groups but we do see that if you take any random ASV there's about a 13% chance that that ASV if you have a V4 will be shared by at least one other species. Okay, so I'm gonna go ahead and save this and I'm gonna write some conclusions and so the subregions are less specific at the species level, then full length and maybe what I will do is to get some data, I'll do overlap data and then I'll filter rank species, right? And so this then shows us the output of that and we talked last time about using cable perhaps and something I'm gonna do to add to this mutate and I wanna write these as percentages. So I'll say mutate overlap, 100 times overlap to get things into percentages, run all these chunks at the same time. So subregions less specific than this, at the species level, then full length. Still full length has, it outputted it into this other data frame at about 3.6% or between 8.8 and 12.6% overlap, okay? So something that occurs to me still is that this analysis does not control for uneven sampling of different species, right? So recall that like E. coli has over a thousand or close to a thousand genomes in the database and so that's potentially skewing what we see. So in the next episode, we're gonna come back and think about how we can account for that uneven sampling and perhaps do something like sample five genomes from each species and see how that changes the results. And I'm gonna go ahead for now and pipe this out to cable and remember we need to add library knitter. Wonderful, showed both to us, maybe not. Anyway, this gets us pretty good shape answering the issue. I'm not gonna totally close out the issue though because I would like to come back on this issue and correct for uneven depth of sampling. So I'll go ahead and do get status, that we need to add something to our make file. I will come down and create a rule for all this, so 2020, 10, 22. And I was calling this ASV taxa overlap, change all the names, that should be good. I will then make that ASV taxa overlap, 21, not 22. Gotta get the date right. And then I tried to build the wrong target. Fun with copying and pasting, huh? So this is gonna run for us. I'll go ahead and commit the changes that we have, create plot of spest of overlap of ASVs across taxa within a rank, addresses number 31, and we're good. So again, what I wanted to drive home in today's episode was this idea of making pseudo code. And again, we could take a problem that might seem really challenging at first. And by breaking it down into steps, we can think about what are the steps gonna be? What are the prerequisites that we need to get to that point? What are we gonna do with the output? And again, in my mind, for me at least, it helps me to liberate the thinking of what do I wanna do versus how am I going to do it? Those are different questions. And often when we stare at a blank screen, we're thinking about both of the things at the same time. Pseudo code allows me to separate the how from the what and focuses on what am I going to do? As you get better at writing your own pseudo code and thinking through problems, you might kind of put in some of that information, some of that coding information into your pseudo code. So here instead of say, make data frame tidy, I may have said pivot longer to make a column for taxa and rank, right? But depending on where you are, the other thing this allows you to do is it allows you to identify those points that you're not sure how to do what you want to do, right? And so perhaps that involves a Google search. Perhaps that involves taking that step and breaking it down into finer steps. But it really is a great organizational tool to help you to think about how you're gonna do your data analysis. So I would encourage you on anything you're working on for your projects to come up with some pseudo code and then flesh it out with the actual code and see if that doesn't help you get through some of those more challenging problems in your data analysis. So I hope you keep practicing. Leave some comments down below in the notes, in the comments about what you thought of today's episode. Have you tried using pseudo code? Does it work for you? Does it not? Where do you find that you still struggle even after you've got pseudo code? All right, so keep practicing. Tell your friends about these episodes of Code Club. Be sure that you subscribe to the channel and we'll see you next time for another episode of Code Club where we'll dig into this question of how can we correct for uneven sampling of our species?