 Welcome back to another episode of Code Club, I'm your host Pat Schloss. We're going to be talking more about the Group Buy and Summarize functions in today's episode. Although we've used these functions previously, we haven't talked about a new argument in Summarize that was released this past summer in the Plier version 1.0, that is, the Dot Groups argument, that is, the topic of today's episode. After playing with it more on some other projects, I find it's actually pretty helpful for making the code in my analysis more explicit and readable. Perhaps you'll recall that when we group our data by some categorical variable, like the genome that the sequence came from using Group Buy, we could then use Summarize to calculate a summary statistic across the observations for that genome. In the last episode, we used this to calculate the number of copies of the 16S RNA gene in a genome and the number of unique 16S RNA gene sequences in each genome. We've been calling those Amplicon sequence variants. Before the Plier version 1.0 was released, the default behavior was to remove the genome-level grouping of the data, removing the last grouping variable. Now Summarize outputs a message telling us what Summarize is actually ungrouping. The change is that Summarize now has a Dot Group argument that allows you to tell Summarize what grouping variable it should be ungrouping by. One argument is Drop Last, which gets you the Pre-Version 1.0 behavior, and it's also the default behavior in 1.0. But it also allows you to use Drop, which removes all grouping for those cases where you've grouped by two or more variables. Another option is to use Keep, which tells Summarize to keep the grouping that was set with the Group Buy. And finally, a fourth option, Row-Wise, that creates a group for each row in the data frame. For today's episode, we'll take a deeper look at why we might use Drop Last, Drop, and Keep behaviors. Although the Summarize documentation indicates that Dot Group is an experimental argument, my sense is that it's probably a good thing for making it more clear what's going on in our code, because it forces us to be explicit about what we're doing. We'll see how this plays out in today's episode, where we'll figure out how many copies of the 16S RNA gene can be expected per genome at each taxonomic group in any taxonomic level. Even if you're only watching this video to learn more about R, and don't have a clue what a 16S RNA gene is, or what taxonomy is, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along, but would like to, welcome. Please be sure to check out the blog post that accompanies this video, where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. So I've already filed an issue for today's episode, issue number 28. And what I'm going to do is we want to look at trends in the RRN copy number across taxonomic ranks. And so I'd like to have on the x-axis the taxonomic rank, and on the y-axis the average number of copies in that taxon. And so, because many of our species have a lot of genomes, but many of our other species only have maybe one genome sequence, what I'd like to get first before we look at higher taxonomic ranks is an average for the species. So you'll recall that E. coli has something like, I think, 800 genomes in the database. And so that's going to really kind of swamp out the signal from other groups. So if we get one number from each species, we can then look at the averages for those higher taxonomic levels. And that, again, will allow us to control for uneven sampling of those different species. So again, this is issue 28, and I'll come over to my terminal, and you'll see that I already created a template for today's episode to make an R Markdown document in our exploratory folder. Let me go ahead and create an issue, get branch, issue 28, get checkout, issue 28, and we're good. I'll go ahead and open up R Studio by opening my R Proj file. And if I look in my exploratory directory, you'll see that for today's episode, I've got this template already created, quantifying the number of RN operons across taxonomic ranks. I think I should fix my date. What I say was 10.5. And again, I have my editor set up to output the chunks to the console. You might prefer to have it embedded directly in the document. For whatever reason, my quirk of how I've worked with this in the past, I really like it going to the console. And then I also have the output set up so that when we render the document, finally it'll be, it'll look decent or good on GitHub. So what I've also copied in is the code chunk from the previous episode. What I like to do in these exploratory analyses is kind of like build up a bunch of different analyses, and then figure out what works, what doesn't, what's going to go in the story, what's going to get cut out. And as we do this, we'll find that there's chunks of code that get kind of copied between different, our markdown documents, or different iterations of our analysis. And in the end, when we try to pull everything together, then I clean up that code and make it dry so that we don't have repeated code across our project. We're called dry stands for don't repeat yourself. We only want code represented once and then we want to call it in different places, maybe using functions or by creating files that we can then use as input. OK, so again, this was what we had before. I'm going to start by commenting out this part from end of line 21 here before the pipe line 22, 23, 24, 25 and 26. What these lines did was they took our wide data frame and made them tidy. But before I do that, I want to get an average for each species. So I'm going to get the average for each species and then I'm going to make it tidy. OK, so we'll go ahead and run these different steps. Ah, and I have to run my libraries and then it will know what the pipe is. And then ASV metadata ASV. And if I come down and we get our data frame that we know and love, where we've got a genome ID, kingdom file and class order family, genus species, strain, the region, the ASV name and the count. And the count is the number of time each ASV shows up in each genome. So we'll have to do two things first. So we will need to get the total number of copies per genome. We'll need to focus in on the V1, 9 region, those two things. And then we'll also come back and get the average for the species. All right, so metadata ASV will pipe that to filter to remove the V1, 9 region or to get the V1, 9 region rather. So this will get rid of all the other regions. And if you're paranoid that that didn't work, you could always add a count region. And we see that everything is from V1, 9. So we're good. Okay, next we need to get the count on the total number of copies per genome. So what we could do, or might be tempted to do, would be kind of group by, and this was genome ID. And then summarize and do say NRNs equals sum of count. If we do that though, one thing that you'll notice is that the summarize function creates a new data frame, in this case with two columns. One column will be the group by variable, and the second column will be the summarize column. But if we do this, then we lose all of our taxonomic information. We could always do an inner join back to metadata ASV, but that just seems kind of silly. What we'll do instead is we will add our different taxonomic levels. So we'll do kingdom, class, kingdom phylum, class, order, family, genus, species. I'm going to leave out the strain because I really just want one representative per species, and this should get us where we want to be. And now you'll see the difference in output is that again, we have all of our grouping columns as well as our summary column. And what you'll notice, of course, is that it also got rid of a few other things, right? It got rid of the strain, and it also got rid of the ASV column. So good. So this looks great, and what I now want to do is I want to group by the species, right? I want to get a summary for each species. And so it's already grouped by that, I see. And so I can go ahead and do summarize. And I'll do mean rn equals mean nrn. Running that, I now see that I've got species and the mean number of rn's for each of those. Maybe what I'll do is I will do a filter on species, Escherichia coli, to make sure everything looks right. Then we get 7.01 copies, so there's perhaps one or two genomes in the database where there's maybe eight copies, who knows. But that looks good, right? That's the kind of output we want. We want one line, one row per species. And at this point, we see that the output is grouped by Kingdom Filem class order family genus, which is great. And what I would normally do would be to say, let's do ungroup. And so if we run the ungroup function, then we no longer have any grouping variables for our data. But now what I'd like to do is make the data tidy. And so I'll do pivot longer. And I will use minus mean rn's. So I'm going to pivot everything except for that mean rn's column. Names of the columns will go to rank. And the values in those columns will go to taxon. Let's see, argument four matches values two, sorry. And so now we get three columns, mean rn's, rank, and taxon. But I also have some na values in there. So as we saw up above, I should probably just copy this down and paste this down, right? So I'm going to also bring down that mutate with the rank. And I'll get rid of all this other stuff. Paste that in. And we will use our drop na, as well as our mutate, all right? And so if we run that, we got rid of our na rows. And we now have mean rn's, rank, and taxon, okay? Great, so this is already pretty quickly turning into quite the pipeline. I now want to group by my rank, right? No, I want to group by my taxon. And so, of course, we will notice that the output of this has no grouping variable, right? So we'll go ahead and do group by taxon, rank and taxon, actually. Because if I just did taxon, we would lose that rank information, right? So we'll do rank, taxon, and we'll pipe that to a summarize. And then we will do mean rn's equals to the mean of the mean rn's column, right? And so now what we see is that we've got a column for the rank, the taxon, and the mean rn's, and that we've got seven different ranks here, right? So we're in good shape. And we see that the average for bacteria is like 4.33, for archaea is 1.61, before I did this, without correcting for numerous copies of different species. And the bacterial average was actually like 5.06 or so, I want to say. So that's no doubt due to the overabundance of things like salmonella and asherichia coli that tend to have like seven or so copies. So again, this is a number that we can think of as being corrected for overrepresentation of each species. So this is good, right? This is the data that we can then plot, and I will then call this data frame, taxon rn's, maybe it'll do rank, taxon rn's. But a few things happen in here that I want to point out. That as we ran this pipeline, this is how I would create the pipeline up to about, say, five months ago, right? But in the last few months over the summer, the developers of dplyr put out a new release of dplyr, and dplyr is where we get these group by and summarize functions. That they, you'll notice that when we run a couple lines like this, where we're getting the total number of rn's per genome, the output at the top of this data frame tells us, summarize regrouping output by kingdom file and class order family genus species. Right, so it's telling us how it is ungrouping or regrouping the data. I think of the group by function as basically making my data like an onion, right, with different layers. So in this case, we've got how many, four, five, six, seven, eight different layers, right, of our onion. And then we ran that summarize and appealed off the genome ID layer, right? So that layer now is gone. And our output, as we can see, is kingdom through species, right? Genome ID is gone. We now, we then got the mean. And if we, if we add that line to the pipeline, we now see that it tells us that it output, it regrouped the output without the species, right? So it removed that species layer of the onion. And we, of course, then see that the output is here, right? So this is the default behavior. This is what group by did prior to the release of 1.0. But it tells us that we can use the group's argument of group by, of summarize rather, right? And so we can be explicit about what we want to group our data by. And so for example, this default behavior we get by setting, using the value drop underscore last to the group's argument. And so we run that, we no longer get that message because we told it explicitly what we wanted to drop the last to drop the genome ID, right? So we could do the same thing here as well on this next argument, right? And so, of course, we run both of those and we no longer get the output message about how is regrouping because we told it how we wanted it to regroup. But, so this actually makes me notice that I forgot that I had this ungroup function. And so the ungroup function following summarize is actually the same as using drop, right? So if I use drop, then I get rid of all grouping information. And so you'll see in the output of the table, there's no grouping data there. And I really like to be able to ungroup my data because when I get into things like pivot longer or some of these other functions, I tend to see just weird things happen. And really, this gets to the important point for today's episode. It's that I like to be really explicit in how I am grouping my data. And I like to be explicit in what I'm doing in my code just in general, right? So if you knew R and you knew what was going on with the summarize function, you would know that this first summarize is doing it on this full list of things I was grouping by. You would then know that the second summarize did everything from kingdom through species, right? But if you weren't so familiar with R, or if there's a lot of other stuff going on in here, perhaps between lines 38, 39, and 40, it might get a little bit confusing. And so at first, when I started using this new version of DePlyer, it kind of annoyed me that it was giving this output because it was just, I don't know, I don't like the warning messages or the notifications. I like for the code to just be clean in its output, right? So as I've played with this and other projects, I've come to realize that I kind of actually like these different options. And if you look at the help page for summarize, you will see that there is actually, there is the groups argument. And you'll see that life cycle experimental, which tells you that they're not totally sure this is what they want to do. Well, I think I kind of feel like maybe drop might be appropriate to be the default. And I can imagine there are reasons not to make that the default, but I'm kind of feeling like that's a default that I might like to use. Because if I put drop there, that will force me to be explicit about what I'm grouping by, right? So I would be basically doing this, right? And that way, if you come to my code and you read my code, or if I come back to my code in a month or three months and read my code, it'll be very clear what's going on, right? And perhaps the example that I have here was already clear. But imagine if you have a more complicated pipeline where you've got other steps intermixed and those three lines were split up. It might not be so easy to track what's going on, right? So again, we can run all this and we get the same result. We see that we have one complaint down here. That's because we have the summarize on this step. But what I'm going to do is I'm going to do groups equals drop, and it'll get rid of that, okay? And so then, again, if I look at my data frame, which I called rank, tax on RRNs, I see the values that I had had before, which is great. There's a couple other arguments to be aware of with summarize, or not arguments, but values to groups. So we talked about drop last, which again is like peeling off that outer layer of the onion, drop, which just takes off all the layers. And then you can also do keep last, or just keep rather. And if I look at running these four lines of code, what I will see is that I still have everything grouped, including all those layers grouped, including the genome ID. So basically each row of this data frame becomes its own group, okay? So that's another option. I'm not totally sure where I would use that, but it's good to know it's there. And then similar to this is row-wise, which I don't know that I've ever used before. And so now I'm going to start looking for places to use row-wise. But group-by, we can think of as grouping on columns and summarize allows us to do operations on those grouped columns. Row-wise basically treats each row as its own group. And that then allows you to do operations on individual rows. Again, I've never used this before that I'm familiar with. There is a row-wise function, just like there's a group-by function. I don't know that I've ever used it. So I'm probably now going to start looking for opportunities to use it and learning more about it. But again, those are the four options that we have. We have drop-last, which is the default. We have drop, which gets rid of all the layers. We have keep, which keeps the full group-by listing. And we have row-wise, which then treats each row in the output as its own grouping variable. And again, I'm going to use drop because I like the fact that it's going to force me to state what I'm grouping by. And again, I typically in a case like this last one, or even at this last one, right? This last where I'm going to end up having things grouped by rank in the end, I would always follow this with an ungroup function anyway. So putting in the dot groups equals drop to me is kind of a nice convenience. And again, makes my code a bit more clear. So if you think this is all much to do about nothing, know that there is another option to all this, right? You could do de-plyer, summarize, info equals false. I think that's the argument. Yeah, summarize and form, sorry. And you can set that to false. And if I remove this dot group, groups argument, and run all that, maybe summarize does need an S. So usually they're very good about using both spellings. De-plyer, summarize, inform. Yeah, for some reason, it's still telling me what it's doing, even though I have de-plyer, summarize, and form set the false. So I don't know what's going on there. But again, I generally don't like to turn off warning messages because I think the warnings should mean something, right? And if there is a problem, then I wanna know about it. So I'll go ahead and add back in my groups equals drop. We run that, we're good. And again, if we look at rank, taxon, RNs, we get our three column data frame, the rank, the taxon, and the mean RNs. And what I'm now gonna do is I'm gonna go ahead and do rank taxon RNs. I'm gonna pipe that to ggplot, AESX equals rank, Y equals taxon, not Y equals taxon, but mean RNs. You'll see that rank is a factor, and we put that code in up here, right? To put things in the same order that we had desired. I have strain in there still, but it doesn't care. So then we can do geom jitter. And I'm gonna let's give this a run, see how it looks. And we see, yeah, we see a mess. But this is what we're going for. I'm gonna go ahead and put in width equals 0.3. Maybe make that cloud a little bit tighter so it doesn't bleed over. Maybe I'll also do alpha equals 0.3. What alpha.3 does is it makes the points transparent so that they then overlap on top of each other. They get darker. It kind of blends in with the background a little bit. So I'm a fan of theme classic and that cleans it up a little bit because I can't help myself. I'm gonna play with the labels a little bit. So labs, X, I'm gonna make null because in cases like this where I've got categorical variables on the X axis, I think it's obvious what my variable on the X axis is. So null then will remove that line that contains rank and why I'll make mean number of RN copies per genome. And then title, I will say is what? There is gotta be in quotes. There is wide variation at the species level in the number of RN copies. And maybe I'll put a line break in halfway here. And I'll go ahead and put in a subtitle which will be each point represents a single taxon within that rank. I'll say numbers based on an average species copy number. Kind of guess about where to put line breaks in here. So I'll save that. I think that's good. My deep thoughts will be what? So bacteria have more copies than archaea even after correcting for number of genomes per species. There was wide variation in the number of RN operons per taxonomic group. Okay, so that's good. We'll save that. I'll go ahead and knit it. Make sure everything works well here. This outputs what it will roughly look like on GitHub. I'm gonna clean up this tab a little bit. See what's going on with this and re-knit and see if that improves what's going on there. Not really. All right, so that's good enough. We accomplished our goal for today's episode. Let me go ahead and commit these changes. So I'll git add exploratory, 2020, 10, 05, put the star to get all those files, git commit, construct strip chart, of RN copies per taxonomic group and level. Closes number 28, close quote. We're good. Git checkout master, git merge, issue 28. Our studio is freaking out in the background because I closed stuff. Should I close this? Yes. And we're good to push. And that's all going. And this is closed. And again, we can go to our exploratory directory now and see that we have a good number of reports in here. Don't know what's going on with the tabs. That's kind of annoying. But we see that our plot looks pretty good. Maybe we didn't need all those line breaks, kind of played with our title here. But again, we can see the average number of RN copies per taxonomic group when we control for even sampling of the species. So wonderful. What we'll do in the next episode is perhaps draw a line across this to show the average for that rank. And so one of the really powerful things about GG plot is the ability to think in terms of layers and put a plot on top of a plot on top of a plot. And so you can make some really nice effects that way. So there's a lot we're gonna play with on the analysis we're working with using what I'll call exact sequence, amplicant exact sequence variants where we're taking the data exactly as it is represented in the genome without accounting for any kind of error in the sequences. Eventually we will come back and we'll entertain some noise and then we'll talk about say, amplicant sequence variants where maybe that we allow for a couple base variation to fold sequences together into the same amplicant sequence. So I hope you found this an interesting discussion of the dot groups argument. It's always a little bit jarring when there's a new argument to our functions that we know and love. I couldn't find a lot of description online for why this decision was made, but hey, if you have any thoughts about the dot groups argument, feel free to leave them down below in the comments. I'd love to hear what you're gonna do with this dot groups argument and how it's gonna change your workflow and the type of work that you're doing. So, and also if you can get that de-plier summarize and form thing to work, let me know, see what I did wrong, I don't know. So anyway, keep practicing and we'll see you next time for another episode of Code Club.