 In a previous episode of Code Club, I showed you how we can use FacetWrap to create different plotting windows according to a categorical variable. In today's episode, we'll use FacetWrap again, but we'll also use FacetGrid to show how we can create separate plotting windows by two categorical variables. Hi, I'm Pat Schloss, and this is Code Club. Each episode, we try to apply reproducible research practices to an interesting biological question. Over the past 40 or so episodes, we've been looking at amplicon or exact sequence variants, a way of describing the diversity of 16S RNA gene sequences that's kind of become the rave in microbial ecology circles. So stay tuned, and we'll tell you more about how we use R to answer questions about these interesting entities. In the last episode, we saw that if we took amplicon sequence variants and defined them at a very narrow threshold, so that the diversity among the sequences that are grouped together in an amplicon sequence variant is very narrow, then we run the risk of splitting a single genome into multiple bins, multiple amplicon sequence variants. To me, this doesn't make a lot of sense because there's no ecological basis for treating different parts of the genome different ways. In some cases, if a genome had seven copies of the 16S gene, they could actually have seven different exact or amplicon sequence variants. But to treat those seven differently doesn't make sense to me. The problem, though, is that as we increase the threshold, sure, we decrease the amount of splitting that happens, but we also increase the amount of lumping that happens. That as we increase the threshold to allow more diversity within an amplicon sequence variant, we run the risk of pulling together different species of bacteria. So there's this trade-off between lumping and splitting. Those of you that are familiar with taxonomy will know that this is a constant battle in the field of taxonomy. I think this is a little bit different because even if you're a hardcore splitter, I don't think you would ever really be a proponent of splitting different parts of a genome into different bacterial species or strains. It just doesn't make sense to me. Also we constantly try to remind people that the 16S gene really doesn't have enough resolution to differentiate at the species level. If you want to differentiate at the species level, you really need a full genome sequence and perhaps use a metric like ANI to differentiate between different bins of species or genomes that belong to those species. So in today's episode, what we're going to do is we're going to use facet wrap and facet grid to see how we can directly compare the level of splitting and lumping for a threshold as well as for four different regions in the 16S gene. Even if you don't know what a 16S RNA gene is, perhaps you're not sure what bacteria even are or why you should care about genomics, stay tuned. You'll learn a lot about how I think through problems, how I break it down from kind of a conceptual logic written in words into our code, how we integrate that with make and version control to make our whole analysis more reproducible. Of course I'd love for you to subscribe to the Rifomonas channel and click that bell icon so you know when the next episode is released. We're really going to be building up our story here over the next few weeks, hopefully leading to some type of manuscript that I'll show you how we will then submit to a journal. I don't want you to miss any of that, so go ahead and click that bell icon. For today's episode, we're going to work on issue 36, which I've already filled in on our issue tracker here on our GitHub repository. We want to know how often is the same EASV found in multiple species, so that's a measure of lumping and what fraction of species have multiple EASVs. That's a measure of splitting and then we want to plot this using different combinations of variables and we'll do that using facet wrap, which we've talked about previously, as well as facet grid, which will be something new for us to talk about today. Also we want to use one genome per species and we want to look at this for each of the regions that we've been tracking in our analysis. Moving over to GitHub, I've already created a branch. You'll see that it is red because I have already created a template file for our lumping and splitting analysis for today, so I will fire up our studio and we will see that we are in our working directory. If I go to files, down at the bottom is my RMD file that I've already created. I basically copied it over from the last episode to this episode, changed the title, changed the date, and we're in good shape. As a refresher of what we did last time in this kind of preamble material, you'll recall that we're largely using the tidyverse and we want to set a seed. This is my birth date, 1976, June 20th. Remember that because I want birthday presents. That's okay. Don't worry about it. Just watch the video, like and subscribe and that'll be enough of a present every day of the year. Anyway, so we're going to read in our metadata. You'll notice in here that we have the code to group by our genome ID and species and then return one genome per species and we're using a slice sample that allows us to randomly pick that one genome. And if we use the same seed, we should get the same genomes. If we wanted, we could run this 100 times and then pull all the results together. I'm not so interested in doing that at this point because I'm still trying to feel out the analysis. And if it's something that I like, then maybe we will do it 100 times so that we can then get an average over all those iterations. But for now we're in good shape. We also are reading in our exact and Amplicon sequence variants file. And then we join those together. Those two data frames together and we are changing the ESV values in that threshold column to be zero because those are really Amplicon sequence variants where the diameter of that bin is zero where there's no difference allowed between the fragments. All right, and then we convert that to a number from character to number, which will allow us to plot threshold on the x-axis without too many problems. So I will go ahead and run this code junk and we will be in good shape. All right. So again, I copied in the information from the issue tracker and I then spent some time going through and outlining what I want to do. So I wrote this all without writing any code, honest, and this really helps me to think about the logic that I want to implement with the code. Right. And as you've seen in previous episodes, when I've done this, my outline isn't always perfect. Sometimes I move things around. Sometimes I add information of things I want to do. But the other advantage of this is it does provide some commentary on why I want to do things and what I'm trying to do. So we will get going with this. And what we're going to do is two types of analysis I've already mentioned. We're going to measure the degree of splitting. And to do that, we're going to group the data by the region within the success gene, the threshold, as well as the genome. And then we're going to count for each of the, for this combination, basically for each genome at each threshold and each region, how many Amplicon sequence variants are there? And if there's more than one, then we want to know what fraction of the genomes have more than one Amplicon sequence variant. Now, the measured degree of lumping is basically the same thing, except instead of including genome ID for pooling our data together, aggregating our data, we're going to include the exact or Amplicon sequence variant information. And then we're going to ask how many genomes have more than one Amplicon sequence variant. So here, how many genomes have more than one Amplicon sequence variant? Here, how many Amplicon sequence variants have more than one genome, okay? And this will then allow us to calculate the two fractions. We'll then pull those two data frames together, and we will then generate a few plots where we are fastening the data by different types of information. So this is great to get us going, and so let's get to it. So metadata, EASV, we will then pipe that to a group by, wherever we see group data by, we want to do group by. So region, threshold, genome ID, we will then pipe that to a summarize function, where we will then do N EASV equals N distinct, and we will then do EASV. And so within each of these groupings, we're going to say how many distinct Amplicon sequence variants are there, and that's going to be our N EASV. And then we will say is split, because this is a measure of splitting the data. So is split, and that will be true if N EASVs is greater than one. We could have done this all in one line, but I kind of like to be explicit in how I lay out my logic. Another sign of being explicit about my logic is to do the dot groups equals drop, which will then remove our grouping variables. And then I can pipe that to group by, and we will then group by region and threshold, and we will then pipe that to a summarize function. For each region, each threshold, we want to know the fraction of splitting, and that will then be the sum of is split divided by the end function, okay? And we will then do dot groups equals drop and be in good shape. As this is running, something to note is that this function splitting uses the sum is split. And that might look weird to some of you, but know that is split is a logical variable made of trues and falses. Numerically, a true is one, and a false is a zero. So if you sum over a logical column, the number you get out is the number of true values, and then that is the sum, right? Another trick that we could have done in place of this would have been to say mean is split, without dividing by the end function. So the mean of 0 and 1 is 0.5, and that will then tell you the fraction of rows, or fraction of genomes in this case, where they are split. Again, more than one way to skin a cat, and this works pretty well. And what we see from the output, at least for the first ten lines from the V19 region, is that as we increase the threshold in the second column, we decrease the level of splitting, which is what we'd expect. And so I'm gonna call this splitting data, and we will load that, and we will then move on to looking at lumping, right? And so we will do metadata, EASV, and we'll then pipe that to a group by function, and again the syntax is gonna be largely the same. Group by region, threshold, EASV, and we will then pipe that to our summarize function. Again, like we saw before, we will then do nLumped, and that is gonna be nDistinct, EASV, I'm sorry, nDistinct genome ID. And we will then say, isLumped, and that's gonna be some, or no, sorry, that's gonna be nLumped, greater than one. Or I shouldn't say lumped, cuz that's not the right thing. So I'll say nGenomes. So how many genomes is each EASV found in? And so that's an nGenomes, greater than one, is a sign that that is being lumped. Okay, we can also then do dot groups equals false, sorry, dot groups equals drop. And we can then pipe that to another group by, where again we'll do region, threshold, summarize, summarize, and we will then do fLumped. And maybe we'll do, yeah, lumped. I don't know why you're splitting up here. So I'll do, where was it, fSplit. And then fLumped is, again, gonna be some isLumped divided by the n function. And we'll do dot groups equals drop. And running that takes a moment or two for it to run. And we see, again, for the v19, that as we increase the threshold, the level of splitting goes up, okay? And let's go ahead and we call this first data frame splitting data. And let's call this lumping data, and that will run nicely. So now what we wanna do is join our lumping and splitting data together. So there's two general ways we could do this. So we could do bind rows, splitting data, and lumping data. And we saw this in a previous episode, maybe two or three episodes ago, we were trying to lump together all of our countable data. And the problem with this is that it pulls together or combines columns that have the same names. But in one of our data frames we have fSplit and the other we have fLumped. And so if one of those columns is missing from the other data frame, it fills it with na values. So bind rows isn't really what we want. We could change the name of the column like we did in that other episode. But I think what would be easier to do would be to inner join and ooh, one of the cool things we get to see here is how to join two data frames by more than one column. We've got two columns that we wanna join on. So before we would say do buy region, right? But we also wanna join using region and the threshold. So let's see how we do that. So we create a vector and the vector are the names of those columns that we want to join by. So we will do region and threshold and add that extra parentheses here. And now what we see is we have a data frame. We don't have those annoying na values. And we have four columns, region, threshold, fSplit and fLumped. We've joined the two data frames using the region and the threshold. Excellent. But I would like to tidy the data a bit so that I have a column indicating the type of approach we use, to measure the fraction of lumping or splitting, as well as a column for the measure of the fraction. Because I would like to make, say two lines, one for lumping and one for the rate of splitting. So we will do pivot longer and the calls will be fSplit and fLumped. And names two will be methods. I'll say method and values two will be fraction. And so then we see that we get region, threshold, method, fraction, or in good shape. And I'm gonna call this lumping splitting data, we'll run all that. And now we're ready to make our plots. So we've got a couple different ways that we can try to do our faceting. And so if we have lumping splitting data, we can then pipe that to ggplot. And our x-axis is gonna be the threshold, our y will be the fraction. And in this case, excuse me, I'm gonna put lumping and splitting as separate lines, and we're gonna facet by the region. So I will then do color equals method. And I will then do a geom line because I want line plot. And then I will do facet, wrap, and I will do tilde region. And what this will give me then is a different panel for each region that I'm looking at. And I don't know why I do this. I always seem to put color in quotes, the method in quotes. Eh, so here we go. This looks much better. And again, what we see in these panels is that as we increase the threshold, we decrease the amount of splitting, but we increase the amount of lumping. You'll see that for several of these regions, basically every region except for v19, we do a pretty poor job of defining a species based on the 16S gene. Even if you allow for splitting genomes, right? And so what we saw with v19 is that like half of the genomes get split. Which is huge considering that there's genomes with not many operons in them. But even here, there's about 4 or 5% of EASVs that are found in multiple species. Again, a sign of lumping. So the 16S really doesn't have enough resolution to get at species level designation, at least as we humans have tried to impose our naming schemes onto bacteria. Again, that's a lot like, this case down here is a lot like Bacillus Sirius, Bacillus Anthracis, and Bacillus Thuringiensis, all getting pooled together or split apart as different species even though the 16S genes are the same. Okay, let's look at this a different way. And again, the syntax is going to be very similar. But let's look at regions as separate lines, but we will facet by our method of lumping or splitting. And our x will be our threshold again, our y will be fraction. And our color, without quotes, will be region. I'll do G online and we will then facet wrap by method. So we'll have two panels, one for lumping, one for splitting. And so again, we see that we have one panel for lumped and one for split. I frankly prefer this approach better than this approach. We've seen this before where we could play with the faceting so that we have a column with four rows. So that it's easier to see the threshold. I like this because it allows me to see within a region, how does the level of lumping and splitting compare. Whereas here, I have to compare back and back across. I don't really care necessarily how lumping compares across the four different regions. Okay, the final approach would be lumping splitting data. And we'll do G plot x equals threshold, y fraction. And I'm gonna leave it at that. And we'll do G online and we will then facet grid and we will do say region tilde method. And this will give us four rows and two columns in our plotting window. The four rows for each of the four regions and the two columns for the different approaches. We can flip the order here. Let me copy this and paste it down below. And I will do method tilde region. We now see that we have the four regions arrayed across the columns and the lumping versus splitting as two rows. And so this is a nice approach too. It keeps the two methods of measuring apart. Really the two y-axis aren't the same thing. It's a fraction of different things. And so this allows you to see for a given threshold, what is the level of lumping? What is the level of splitting? So I like this, I also like this first version too. And so some of it is a matter of personal preference. It's also a matter of what's your question, right? And so some of these other approaches where we faceted by whether we're lumping or splitting or this first approach to the facet grid weren't so attractive to me because they're forcing me to make a comparison with my eyes that I'm not really interested in. Okay, so I'll go ahead and save this and add some conclusions. So as we increase the threshold, splitting drops and lumping increases. Again, we need a decent sized threshold, say greater than 0.01, to reduce the level of splitting and would prefer to limit splitting over lumping because species designations are too squishy. I'll say they're human made rather than bacterial made. Okay, so I'll go ahead and save that and I will come back and open up my make file in Adam and I will add this target to the make file so we can go ahead and build it way down here at the bottom. We'll do exploratory 2020-30-lumpingandsplitting.md. I'll save that, copy the target and we'll go ahead and make it. This will run. I'll go ahead and commit the changes to the repository. I hope you found this description of how I go through an analysis useful. Couple new things that we talked about in here. One was joining two data frames on multiple columns as well as seeing how we can use facet wrap and facet grid to do faceting in different ways and to get our plots to have a different layout in the finally rendered figure. So I hope you can play with this with your own data. Perhaps you're generating different facets. The facet grid is really nice when you have two variables that you want to facet your data by. So again, please be sure to subscribe to the channel and tell your friends about what we're doing here on Code Club. I think there's a lot of great principles that I'm talking about in regards to doing research in a reproducible manner that people get a lot out of regardless of if they give a care about Amplicon sequence variants or bacteria. I don't know who doesn't care about bacteria but I'm sure there's a few people out there. Anyway, keep practicing and we'll see you next time for another episode of Code Club.