 Welcome back for another episode of Code Club. I'm Pat Schloss. In scientific publications, it's often common to have figures that have multiple parts. I think a lot of people will generate each part separately and then pull them all together in something like Adobe Illustrator or PowerPoint. I think this because, well, that's what I used to do. It was horrible. Depending on the type of figure, there are different ways to compose multi-panel figures in R. In today's episode of Code Club, I'll show you one of the approaches that's called faceting. Sometimes we have a third variable, typically a categorical variable, and including it as a different color or shape is messy or distracting from the overall story you're trying to tell. Facets come in handy because you can pull the data apart by that third variable to more clearly compare your primary variables within that third variable. I've also used it in cases where I have multiple measurements across objects, and I would like each measurement to be its own panel. For example, I might have different measures of alpha diversity, and each measure would be a different panel of the same entities that I'm measuring. For an ongoing analysis, we've been looking at the number of Amplicon Sequence Variants, or ASVs, per genome. We have noted that the number of ASVs per genome varies depending on the region of the 16S RNA gene that we're looking at. Today, we'll see how we can make each region its own panel, or facet, in a single figure. In previous episodes, we've seen that the number of ASVs per copy of the 16S RNA gene is about 0.66. This means that if a genome has 10 copies of the gene, we would expect to see about 6 or 7 different versions of the gene within that genome. So how does this scale across genomes within the same species? We've also shown that for full-length sequences, 82% of the ASVs were unique to a genome. For the subregions, about 76% of the ASVs were unique to a genome. This leads me to the hypothesis that as we sample more genomes from a species, we're likely to keep seeing new ASVs. Besides the problem of splitting a genome into multiple ASVs, is it possible that binning sequences into ASVs may cause us to risk splitting species too finely? To investigate this further, I'd like to look at the relationship between the number of ASVs found in a species as a function of the total number of genomes that have been sampled for that species. Besides looking at the full-length sequences, we'll also look at the V4, V3, V4, and V4, V5 subregions of the 16S gene. We'll show the results of these analyses using facets. Even if you're only watching this video to learn more about R and don't know what a 16S RNA gene is, I'm sure you'll get a lot out of today's video on facets. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please be sure to check out the blog post that accompanies this video where you will find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. As I mentioned, today I want to investigate how many ASVs describe a species and whether that number will plateau with the number of genomes. So E. coli has over 900 different genomes represented in the database. And so one question we might ask is, well, are there 10 ASVs for E. coli, or are there maybe 2,000 ASVs for E. coli? And of course, E. coli is just one bug that we all know. But how does it vary across other genomes? And is there a relationship between the number of genomes we've sampled and the total number of ASVs for each species? So I've already created an issue, issue 30. So I'm going to go ahead and move to our project root directory. And as you'll see, the red master tells you that I have a file in here I've already created. This is a template for today's episode. I'll go ahead and create an issue branch, issue 30. Get checkout, issue 30. And so we're on that branch now. So we're in good shape. I'll go ahead and fire up our studio. And as you can see, we're in our project root directory. And this is showing me in our exploratory directory. And a nice thing about naming these files, these exploratory data files with this version of the date, which is the ISO standard. This is how you should represent the date is that if we list them by name, then the last one will be the most recent version, right? They're in chronological order as well as alphabetical. All right. So this is the template that I have in previous episodes. We've already talked about reading in metadata and ASV and making this metadata ASV data frame. So I'll go ahead and run this current chunk. That seems to work well. And this then brings us to our section that we will be working on today of looking at, as I've already mentioned a couple of times, whether the number of ASV per species increases with sampling effort and whether that varies by region within the 16S or RNA gene. So I need to know a few things to make these plots. So on the x-axis, I want to put the number of genomes. So each point in my plot I should back up and say is each point is going to represent a different species. The x-axis position is going to represent the total number of genomes I have for that species. The y-axis, so let me go ahead and write this down. That x is going to be the number of genomes for that species. I'm going to put in the comments why I'm going to make the, it's going to be a ratio. So the ratio of the number of ASVs per genome. So we would expect the number of ASVs per genome to fall off as we increase our sampling of genomes. And so the question is like how steep is that fall off? And then each point represents a different species. And then each facet represents a different region. And I can't spell, so why bother? All right, so we need to build up the data frame that we're going to plot with this. And again it's going to read in metadata ASV and to remind us what that looks like. Oh look, it went ahead and put it in here for me. And I can see all the different columns. And I'm going to be looking at the species level. So I'm going to go ahead and tell the plier what I want to select. So I will select, double check I've got what I want. So let's do genome ID, species, and I want region, right? I want the ASV and I want, do I want the count? I don't think I want the count because that tells us how many times each ASV shows up in each genome. And I'm not really interested in that. I'm purely interested in the total number of ASVs, not their frequency. Okay, so we run those two lines. We then get genome ID, the species name, as well as the region and the ASV. Okay, you know this is kind of driving me nuts. I'm going to have the chunk output to console because it's just too wide and it's just annoying. So I'm going to go ahead and remove that output. And again, if we run this, we see the output and all the columns we're interested in. Great. So what we want to do next is we need to get, as I mentioned, the number of genomes for each species and the ratio of ASVs per genome. So I'm going to go ahead and do a group by, so we'll group by the region and the species, because we're going to group within each species is what we want the information by. And then I'm going to pipe that to a summarize. And I'm going to get a couple of variables. So I want to know the number of genomes for that species. So I will say N genomes and that's going to be N distinct. And then that's going to be on the genome ID column. Right. So if I run this, this will then tell me for each region and species, how many genomes I have, which is good. And as a test, you know what? I'm going to pipe this out to filter and say, give me the data for species equals Escherichia coli. Escherichia coli. And so we see 958 genomes. Okay. And we're going to keep adding to the summarize. And maybe I'll put this filter down on its own line. We also want the number of ASVs. And that is going to be N distinct ASV. Right. So that's going to be the total number of ASVs that we find across all the genomes in the species. And so if we have 958 genomes, we see we have 1013 total ASVs. Okay. So I want to get a ratio and that's going to be the number of ASVs per genome. And ASVs, because, you know, it could be 1000 ASVs, but is that over 10,000 genomes or is that over 100 genomes? Right. We want to be able to scale it for the total number of genomes. And something we might also think about is, you know, would we need to scale it by the copy number? And I don't think so because what I'm really interested in understanding is if I get a ASV, you know, how many total ASVs are there? You know, that's one of how many for that species. And so scaling it by the copy number is kind of a different question. So I'm going to leave it with an ASVs divided by N genomes. And I'm going to call this the ASV rate. Right. And I'm going to go ahead and do my dot groups equals drop. We run this, then we'll see the summary data for E. coli. And again, we see our ASV rate as we'd expect the number of ASVs per genome for Escherichia coli is over one. So every time we add a new genome of E. coli, on average, we get one new ASV, which is, I think, a lot, right? Whereas if we look at, say, like the V4 region, it's 0.2, then for every five genomes we add, we add one new ASV when we're looking at the V4 region for Escherichia coli. Okay. So I'm going to remove that. And I'm going to save this as a variable that we will call species ASVs. We'll load that. And while this is running, it takes a little few seconds. Be sure you like and subscribe this video so you know when the next episodes are released. Great. And so we have our species ASVs. And we again see all the information we want. And we can then go ahead and say species ASVs. And we're going to pipe that into ggplot, AES, for our aesthetic. It's going to be a dot plot. So on the x-axis, we're going to put the number of genomes. So n genomes on the y, we're going to put the ASV rate. And then we will do geom point. And this will throw everything together, which isn't exactly what we want. We see it's pretty hideous down here. So maybe what I'll do is I'll do alpha equals 0.2. And maybe I'll do color equals region for now. And what we're seeing is we get this kind of L-shaped curve, very, very tight, which tells me that maybe we want to put this on a log scale on the x and y-axis. And so we haven't talked about that before. But it's similar to what we've seen with the scale x whatever or scale y whatever. But instead of being, say, continuous or discrete, we'll do scale x log 10 plus scale y log 10. And hopefully this will linearize the shape. And so it's more of a cloud. But we see we've got these different shaded points, right? And each shade is for a different region. And so this is where we come to facets, right? And so what we'd like to do is to perhaps make four separate plots, one for each region. So there's two functions I want to introduce. So facet wrap and facet grid. We're going to start with facet grid. And ultimately, that's not what we really want. And I'll show you why. So facet grid creates a two-dimensional array of figures, right? And what we might imagine is having two variables that we want to make grids by, right? So say we wanted to look at region and as like one variable and as another variable, maybe we wanted to put like gram stain or some other variable. We could then have regions across the top. And then the rows could be, say, whether or not they're gram positive or gram negative. That's a silly example. So in our case, we really only have one example, right? And what we can do is we can use a formula notation where we can say period tilde region. And what that will do is that will say there'll be rows and or I think this is the row and the region will be the column. And so what we'll get on the right here is, yeah. So we get one row and we get four columns and each column corresponds to a different region. And so that looks decent. Again, it's really compressed because of the way we have the output here. But if I do region tilde period, then we will get four rows, one row per region, right? And so this works reasonably well. And again, it works typically better when you actually have two variables that you're trying to compare things by. An alternative that gives you perhaps a little bit more flexibility when you only have one variable is facet wrap. And with facet wrap, we can say facet equals and then what we want to facet by. And so it's going to be region in our case. And in this situation, we get a 2D array of plots and 2D, I guess I should say, we've got two rows, two columns. And I can change how many rows and columns I have here by, say, doing n row equals one. And this will give me one row, or we've already seen two, but say I do three rows, because it doesn't like that for some reason. If I did n call equals to n call, not call, trying to get it to give us a three row look or two row look with three columns. Yeah, again, you can change how many columns and rows you get. If you're to do facet grid, then you can only get the different regions either in the same row or in the same column. I personally prefer the two by two look where we have the four regions. You could make a case and that what I would find compelling that we should have them as a row or as a column. Again, that kind of depends on your own personal aesthetics. Something else we might like to change is where the location of the strip is. This is called the strip on the facet, and by default in this case, it's on top. And so we can use strip.position as it's helping us there. And so if I say bottom, it'll put the strip at the bottom of each of those facets. I don't know if that works well for you or not. Again, we could do left and you can do right to get it on the right side, but this is left. And in this case, I think the default of top is really what we like. Something else that we might find is that these plots don't necessarily all use the same space of the plot. And so they all have the same y-axis and same x-axis. So they all have the same x-axis because the number of genomes doesn't vary by the region we're looking at. But the ASV rate certainly does. And so you'll notice up here that like the V-19 doesn't touch this kind of bottom layer of the plot. So what we could do is if we wanted each plot in the facet or each facet to have its own scale, you can use the scales equals and then you could say free y. And this then will free up the y-axis to have its own independent scale. And so you can look at these and see that they all have their different scales, right? Sure, the top value is 10, but you'll notice that where that 10 falls on the y-axis is a little bit different for all four. And the bottom point, the bottom line is not the same, right? So for V4 it's .01 and for the others it's .1, but they fall at a different position. If we wanted to free up x, which isn't going to change anything here, you would use free x. And if you want to make sure that it's fixed so that all the panels and all the facets have the same x and y, you'd say fixed. And so I think that's the look that I'm going to go with just because it allows me to more easily compare across the four facets. I'm going to go ahead and put each of these arguments on its own line, because we're getting a lot of arguments and we're going to run off the edge of the screen. And what I'd like to do is show you how we can change these labels. And so you can use that using a function or an argument called a labeler. And that equals a function called labeler. And you can do all sorts of sophisticated things here to get the desired label you want. Sometimes people do things like put in Greek letters or mathematical notation or whatever you want. And what I want to do is instead of v19, I want to say v1-v9, v3-v4. So to do this, what I'm going to do is I'm going to create a new vector. And I think we've talked about this in previous episodes. And so what I'll call this are region labels. And it's going to be v1, v9, and then v4. So these are going to be the labels that I actually want to show up. And I need to name the labels, name the region labels. And so these are going to be the actual names that I already have that correspond to those region labels. So v1-9, v4, v3-4, v4-5. Of course, I could have renamed my labels, my regions in my original data frame, but you know, it's okay. So we'll go ahead and do region labels in here in my labeler. And if everything worked, then I should get those nice labels in my grid. And I don't. And I realize that I'm missing an argument here or what I want to say is region equals region label. So I need to say what region or what variable that I'm labeling by do I want to relabel. And so this should work now. And sure enough, we see we have v1-v9, v3-v4. And so the key thing that I had missed was this region equals. And so saying that we're going to apply these labels from region labels to the region variable in our data frame. Now we've been using theme classic. And let's see what this looks like for using facets. So that looks interesting. It puts a solid border around the box. Again, by playing with the themes, you can get those different looks. I don't need these different colors. So I'm going to go ahead and remove color region. This will give a little bit more space in the plotting window. That looks pretty good. And I'm going to go ahead and add geome smooth to get a sense of what the shape of this trend looks like. Is it falling off linearly or does it asymptote? Let's see. And so what we see is that v1-9 is falling off, but not very steep, right? So for the most part, as we add more genomes, the full length ASVs continue to increase. They fall off quicker for the other sub regions like v3-4, v4 and v4-5. But if we believe the spline that this is fitting, it does seem to plateau. And so that there's always perhaps going to be more ASVs that you sample as you add more things at some slow rate, which is interesting and again argues that perhaps ASVs split a species just way too finely. So I'm going to put those in here as my conclusions. So v1-9 continues to add significant, I'll say, as more genomes are sampled from ASVs. And then the subregions seem to have plateaued, indicating that perhaps we'll always add more ASVs for species. Perhaps we really are. And what I would like to know is I would like to look at individual genomes in more detail. And so that's what we're going to do in the next episode. We're going to look at this in more detail, looking at individual genomes to say, you know, what are the number of ASVs across these four different regions for different species? And so to do that, we'll save that for the next episode where we'll learn some new tricks. For now, I'm going to go ahead and save this and fix my typo here. That looks good. And I will come back and create an issue for this. So I'll open this and create a rule for this and make down here at the bottom. Again, should probably make this a little bit more dry because we keep repeating a lot of the same ideas. But the ASV species coverage, copy this to get the name right. And then down here to get the RMD file right. And so if I build this, and so it's saying no rule to make the target ASV species coverage. I misspelled species. I probably did that everywhere I copied and pasted it. Copying pasting is great except when you do it wrong. So let's save that build that up. This is going and I'm going to go ahead and commit this issue. So I'm going to add the make file exploratory stuff. And then we'll build faceted plot of ASV rate by number of genomes. And I'll say addresses number 30, but I'm not going to say closes it because we're going to come back next episode and add some more information to our markdown document and output. That gives us a sense of, you know, for those genomes where we have a lot of genome sequences. What what do the number of ASVs look like in a more tabular format. So this is good. This is a good stopping point. We'll come back to this in the next episode. As I mentioned, please keep practicing with this. Think about where you can use facets in your plots. Are there categorical variables that you can use to split data apart so that you can look at it chunked out by these different levels of a categorical variable. Keep practicing with this play with it. It is a different way of thinking about things. But it's another way that you can use are and are reproducible practices to make a multi figured or multi parts plot that doesn't require you to be kind of throwing things together from different sources in something like Adobe Illustrator or PowerPoint, which is just horribly painful, especially if one of those figures gets updated, then you have to go and do it all right. So at least with facets here that if we update a parameter, if we update the data, it will generate this faceted plot for us all over again without us having to, you know, worry about how to mix and match plots from different analyses. So keep practicing. Be sure that you've subscribed to the channel and that you've clicked on the bell for notifications so you know when the next episode is released. And we'll talk to you next time for another episode of Code Club.