 A super common task in analyzing data is to measure summary statistics for different groups of items. We might want to know the average age for Republicans versus Democrats. Perhaps we want to know whether men have higher salaries than women. Maybe we want to determine how salaries vary by level of education. Each of these questions requires us to take a set of data, subset the data by a group, and then calculate a value for each group. So how do we do this in R? One approach might be to extract the data corresponding to Republicans from the data set and then calculate the average age. We can then repeat for the Democrats. Let's say there was a third party we were interested in. Well, we would repeat the process, but for the third party. But what if there's a fourth party? A fifth? Well, that would get tedious very quickly. There's an alternative approach available to us using the de-plier package, which is also part of the tidyverse. The de-plier approach is super powerful. We can use the group by function to group our data by a categorical variable like political party. We can then use the summarize function to calculate summary statistics for each value of the categorical variable. This approach is super adaptable to any changes in the data. I don't need to know all of the political parties to carry out this process, which is part of what makes it so powerful. Furthermore, we can apply this general approach to a wide array of questions. Now, if you've forgotten what we've been trying to do over the past 17 episodes, I don't blame you. A lot of the work in a data analysis pipeline is getting data into a format that allows us to answer our questions. You might recall that bacterial genomes tend to have more than one copy of the 16S RNA gene. So this raises a few questions. First, how many copies of the gene do bacterial genomes typically have? Second, if there's multiple copies of the gene in a genome, are they identical to each other or are there differences? Finally, if we find a sequence in one genome, how often might we find it in another genome? These questions are important to the ongoing discussion in microbial ecology because there's a push to using distinct sequences or Amplicon sequence variants, also called ASVs, as surrogates for bacterial taxa rather than using clusters of related sequences. Let's think about the data as we currently have it in our data frame. We have a column that contains an identifier for each 16S RNA gene sequence that is for the ASV that was found in the RRNDB. We have a second column that contains an identifier for each genome in that database. Finally, we have a column that indicates the number of times each ASV was found in each genome. But how does this relate back to the examples I started with at the beginning of this episode? How is this similar to grouping people by political party? Well, if I grouped the data by genome, I could count the number of 16S genes in each genome. I could also count the number of ASVs in each genome. Alternatively, if I grouped the data by ASV, I could count the number of genomes where each ASV was found. We're really going to leverage this approach in future episodes. Today I'll be talking about using full-length sequences, but I could also use a subregion of the gene and see how these counts change and I might even do that for the V4 region in this episode. But instead of looking at the genome level, I could also look at the species, genus, or any higher taxonomic level as well. But we'll look at how those counts vary by taxonomic level in a future episode. For today's episode, we'll use the group by and summarize functions and a few other helper functions to do an exploratory data analysis of the sensitivity and specificity of ASVs to each genome. Even if you're only watching this video to learn more about R and don't have a clue what a 16S RNA gene is, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your computer. If you haven't been following along but would like to, welcome. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the notes. Excellent, so by now we should be old pros at getting into our project directory. Again, we've got the alias RRN that moves us there. And we see that we're on the master branch. It's green. All systems are good. So today's work is going to be largely exploratory. So I'm not going to create an issue, but something I want to remind you is that we have a directory in our project route called exploratory. And if we look in exploratory, we'll see that all that's there is a readme file that's also blank. And so that readme file is there so that we can kind of keep this under version control if we choose to commit anything. I'm going to go ahead and open up the Schloss RRN analysis, our project file. This will then launch our studio for us as we saw in the last episode. Great, this brings us into our R session. I'm going to create a new R script. And I'm going to save this as, let me expand that because I want to save this, not in code, but I'm going to put it into exploratory. And I'm going to save it as 202009-09. So I'm going to post this on the 10th. But for me today, it's the ninth. And I will do 0909.R. And so I'll save this as a R script with today's date. I probably should give it something a little bit more descriptive than just the date. Maybe what I'll put here is something like genome sends spec. So genome sensitivity specificity.R. And go ahead and save that there. I'm going to start this with a library tidyverse and go ahead and load the tidyverse. So we've got that in good shape. What I'm going to work with is that count underscore tibble file that we had from the last episode. So I'll call that FL for full length data. And we'll do read underscore TSV. And that was in data v19 forward slash. Where did I call that? Yeah, RNDB, count underscore tibble. And so if we run that, we see that this loaded very well. I could in my read TSV go ahead and specify that ASV is a call character, genome is a call character, and count is call double. But again, for this type of kind of rough exploratory analysis, I'm not going to worry too much about that. These aren't warning messages, but I like to be explicit with read TSV and telling it what these different things represent. So again, if we look at FL, our data frame, we see that our first column is called ASV. And so this is an identifier for each unique 16S sequence that was found in the RRNDB database, that database of all of the 16S copies in all of the genomes that have been sequenced. We have a column for the genome identifier, the accession number for the genome. And then we have a count for the number of times each ASV was found in each genome. Now, what we're going to do is if we do FL, we can use the pipe character, which we saw in the last episode. And we've seen in other episodes, if you've been a loyal watcher way back to the original episodes of Code Club, you've seen me use this pipe. And really what this is doing is it's directing the flow of data through my pipeline. So what I'm going to start by doing is let's group our data by the genome because what I'd like to find out is how many copies of the 16S RNA gene occur in each genome. And what we can do is we can do group by and we'll group by genome. Now, if I run these two lines, the output that I get is basically the same as I had up above. And so if you notice, there's a subtle difference and that's that there's this line now groups colon genome and it tells us that there's 15,389 genomes in our analysis. But otherwise, nothing has changed, right? Now, what I'd like to do is for each genome, I would then like to do summarize. So we're going to group by and then summarize. And I will then say n underscore rrns, maybe just nrn, so the number of rrn, that's the operon that the 16S gene is in. And we'll say that is equal to the sum of count, right? And so you can see here that I've got the first nine rows in my data frame all come from the same genome and that we've got these different ASVs and so it's got nine ASVs and they're all different, right? So this is kind of like a worst case scenario, if you will, for ASVs. And so if we wanted to add these together, we would use sum, the sum function, which takes a vector of numbers, a series of numbers, and adds them together. And so we'll go ahead and run that. And again, what we find is that initial genome name we had had nine copies. And we see that sum have one, six, four, five, seven, 12, right? And so we see a wide variety of numbers of copies in there. We could also say pipe this into ggplot. So ggplot and we could then say aes and on the x-axis I'm going to put nrrn and I'm going to use geohmhistogram and this will then make a histogram for me that you can see in the bottom right corner here. And that looks a little bit messy. So maybe what I'll do instead would be say binwidth equals one. And maybe we get, we kind of get like a bimodal distribution, don't we, that we kind of have a peak at about four and then we also have a peak at about seven or eight and there's one that's way out here at 21. So we see that most of the genomes actually have more than one copy of the 16S RNA gene in them. And we'll talk more about ggplot in future episodes, but I just wanted to give a quick illustration of what that distribution looks like. Okay, so that was the first question we had, right? Which was how many rn copies are in each genome? So something that I'm seeing in my output here is that summarize is ungrouping the output and so we can override this with the dot groups argument. So if we do question mark summarize, and you can write summarize with a s or with z, I believe the s is the New Zealand or the UK spelling, that if we look at the groups argument, we see that this is an experimental part of the life cycle and that what we can do is drop. So if we say groups equals drop, then we will remove all levels of grouping. And what we can do is up here in our summarize line we could do dot groups equals quote drop. And if we run that, we see that that information message is removed. Now what we might like to do is to know what fraction of genomes only have one copy. Well, we can repeat these first three lines here and we can actually group by the number of copies. So we can group by the number of copies and we can then say summarize and we could then say n genomes equals and we can use the n function and that's going to count the number of genomes that have each number of copy. And so what you'll see here then in this output is that there were 1566 genomes that only had one copy, 1740 that had two and so forth, right? So I think the most common is 2,671 with seven copies and if I was a betting man, I bet that there's a lot of E. coli genomes in the database and E. coli has I think about seven copies in its genome. Now we could add a little bit of interpretation to this by using a function called mutate which will add a column to our data frame and so we could then say fraction, right? And so we'll create a column called fraction which will be n genomes divided by the sum of n genomes, right? And so these helper functions like sum we can use in many different contexts and so what we're doing is we're creating a column called fraction which is going to be the number of genomes in each class, in each number of RN copies and then dividing that by the total number of genomes and again up here for summarize, I want to do dot groups equals drop to get rid of that message. And what we find is that about 10% of the genomes have only one copy and as we saw before about 17% of the genomes have seven copies of the 16-S RNA gene, okay? So again, this was the initial question we posed. We can think about things in a tabular format like we generated here or as a histogram like we had to the right. So the next question I want to take on is what is the number of ASVs per genome? And again, the workflow is going to be very similar. So we'll do FL and we're going to pipe that to a group by and again I'm going to group by genome and I'm going to use my summarize function to then say NASVs, or NASV and that's going to be equal to the number of rows. So we'll use the N. So sum adds up a column whereas N will return the number of rows and N doesn't take any arguments. And again, if we run this we find that that initial genome that we saw before had nine copies and has nine different ASVs in it. And so we might like to know well that's the number of ASVs but what's also the number of copies? So we might want to know N, R, N and we saw this up above as the sum of count. So we're kind of combining both workflows here and again we'll do dot groups equals drop and what we'll see is all of our genomes in the first column, the number of ASVs in each of those and then the number of total copies of the six-ton S gene in their genome. Maybe what I'd like to do is go another step here similar to what we did with counting the number of copies per genome and I can again do group I, N, AS, N, R, N and I want to know what is the median number of ASVs for that number of six-ton S copies, right? So I can do summarize and I can then say meadNASV and that will be medianNASV and this then again gives me all of the number of R, Ns and then the median number of ASVs so we see that the median number of ASVs if you have three copies in your genome is one if you have seven the median number is five actually for full-length sequences. Perhaps we'd also like to get the quartiles so what we could do would be L, Q, N, ASV so for the lower quartile and we can use the quantile function on NASV and we can say prob equals 0.25 this will give us the 25th percent confidence interval and then we can do the upper quartile on NASVs quantile again, N, ASV, prob equals 0.75 I feel like we kind of did this type of work before when we were talking about weather data in a much earlier episode and so again we can see the interquartile range the median value and so we see that if we have seven copies that 50% of the genomes in our database have between three and six ASVs in them if there are seven copies of the six-ton S RNA gene in it then this is a tabular output what I could do is I can copy these three lines here if I can get my highlighting right and I can pipe this again to ggplot AES and I can say x is the number of RN copies y could be the number of ASVs and I'm going to send this to geomesmooth we talked about using geomesmooth previously and then we were plotting the lamb price data and I can use method equals quote lm and so if we plot this what we get is a straight line, as we might expect between the number of copies and the number of ASVs and that it rises such that at about if you have 15 copies of the six-ton S gene then you're going to have about on average 10 ASVs in your genome that rises at a slope of about two-thirds you might say so again as you have more copies of the six-ton S gene in your genome you're more likely to have more ASVs per genome whether or not this is a good thing we'll revisit as we go through the analysis okay so that was what is the number of ASVs per genome and the final question that I was interested in is how many genomes does each ASV appear in? okay so we've been grouping by genome what we're going to do then is to group by ASV right and so we're going to group by the ASV and what I want to know for each ASV then is how many genomes are represented for each ASV and so I can do summarize and I can then say N genomes and that will then be N because that will tell us for each of our ASVs how many genomes did it appear in and again we see that there's some where an ASV shows up in three different genomes one, two, three so if we want to count the number of times ASVs show up across different genomes we could then pipe that to the count function and we can then count N genomes and this will count the number of times an ASV shows up in three genomes or two or one or seven or however many and we run this and we find that there is in general an ASV is found in one genome right again one genome might have multiple ASVs but if it has that ASV that ASV tends to be unique for that genome and so to get a sense of what percentage or what fraction of all ASVs this is represented by we can come back and again do a mutate to get a fraction and then do N divided by sum of N and again what we find is that 32% of the ASVs are unique to a given genome but again what we found kind of as shown by the plot over here on the right is that a genome as it has more copies of the 6S gene will have more ASVs alright so again this is a kind of a quick exploratory analysis that we have done and I guess what I'm noticing up here is that I can refactor this code a little bit so if I were to run this again you'll recall that this gave us output very similar to what we had before so perhaps instead of group by and summarize what I could then instead do would be to do count N, R, N and this will give us very much the same type of output again so the number of RNs and how many genomes that is and so here I'm going to change N genomes to N right again we get the same output but with a slightly different approach that count is a special case where if you're doing group by and then using summarize with the N function you can replace that with the count function on the column that you would normally be grouping by okay again both approaches work the count is a little bit more concise alright so this is the V, sorry the full length data I hinted that maybe we'll go ahead and try this with the V4 data so how about with the V4 data okay so instead of full length I'll call this V4 and I'm going to replace that V19 with a 4 and read that in and what I'm going to do is I'm going to come back to two of the analyses that we did this first one that I just refactored where we're looking at how many copies of well this isn't going to change right so how many RN copies in a genome that doesn't vary whether it's full length or a subregion but this question of of these two steps sorry my mouse got a little bit silly there I'm going to copy this down and instead of FL we'll do V4 and what this again tells us is as we increase the number of RN copies how many ASVs are we going to have and again if we run this we get a tabular view and if we had seven copies of the 16S gene in a genome then we're very likely to only have one or maybe two copies one or two ASVs in that genome so again the V4 region has a lot less information and isn't going to discriminate across copies of the 16S gene in the same way that like a full length gene would if we plot this we would then expect the slope to be considerably flatter and so what we find is that for 10 copies of the 16S gene that we would expect there to be maybe about one in three quarters one to two one to two different ASVs in the genome and so a subtle difference that we might notice here is that these data in the table are the medians and over here is the mean and we could you know perhaps think about changing this so instead of median maybe we do mean and that we would see then that for seven copies we'd expect about one and a half ASVs per genome which is basically what we see over here right okay so that's looking at the number of ASVs per number of copies in the genome let's go ahead and ask about how many genomes each V4 ASV appears in I suspect that because there's less variation in the V4 region relative to the full length gene that we'll be more likely to see the same ASV in multiple genomes so you'll recall that 82% of the full length ASVs were unique to each given genome let's see what it is for the V4 region so what do you think it's going to be what percentage well what we find is that it actually goes down to about 76% so of those ASVs that we observe 76% of them are unique to a specific genome and we could again think about other variable regions within the 16s gene and something we also might think about is that this again is looking at the specificity of an ASV for a given genome we might then say well at the species level or the genus level or what if we kind of expand the definition what we're talking about here are exact sequence variants but as implemented in a lot of software doing Amplicon sequence variants they actually allow for a little bit of slop in their definition of an Amplicon sequence variant so they might need to entertain one or two nucleotide differences so what I want to tell you is that we are going to get to those questions but this general approach that we are going to summarize will be seen later when we say group by a species or group by a genus and count the number of ASVs or look to see how many species does each ASV appear in so does an ASV appear across multiple species or is it unique to a given species and then again as we expand the definition of what an ASV is to allow some more wiggle room and variation what then also happens to the sensitivity and specificity of those Amplicon sequence variants and again as I mentioned there are some taxa like E. coli that have maybe a thousand genomes represented in this database and so we'll also want to correct for the inflated number of genomes and inflated representation that we see in the database okay so don't forget to like and subscribe the video and to go ahead and click on the bell so you know when the next video drops and to tell your friends about what we're doing here in these Code Club episodes also I would love to hear what you're doing to analyze data using the material that I'm covering in these episodes go ahead and put them in the comments down below if you have questions about what we're doing or have ideas for where we might take the project going forward in the comments as well for now I'm going to go ahead and save this R script and quit out of R Studio and I'll say goodbye and see you on the next episode of Code Club