 Hey folks, in this series of episodes I'm looking at a variety of different approaches of quantifying the dissimilarity or distance between different bacterial communities. We've kind of stumbled upon an issue that rears its head now and then, which is the issue of rarefaction, or really how do we deal with communities that have different numbers of observations in them? What we commonly find in the literature is between 10 and 100 fold variation in the number of sequences in different communities. What that means is we might have 2,000 sequences in one community and we might have 200,000 sequences in another community. It doesn't seem fair to compare those communities without doing any type of correction for that uneven sampling effort. In the last episode I showed you what happened to Bray Curtis distances if we don't apply any correction. Also I showed you what would happen if we used a process called rarefaction. Now I want to go back and explain rarefaction again because today we're going to talk about other approaches for doing the correction, namely relative abundance and normalization. Okay, so for rarefaction, again if we had 2,000 and 200,000, what we would do is to draw 2,000 sequences from both communities, calculate the dissimilarity, put those 2,000 sequences back into those respective pools, and then pull out another 2,000, calculate the distance, set that distance aside, and then put the 2,000 back, and we'd repeat that like 100 or 1,000 times. With those then 100 or 1,000 distances we could then calculate the average distance as well as if you wanted the standard deviation of those distances. That process of repeatedly drawing 200 sequences is called rarefaction. It is different in my mind from subsampling. Subsampling is rarefaction where you do it once, right? So if I were to grab 2,000 from both communities, calculate a distance, and then move on with that distance, that would be considered subsampling. And a problem with that is that it's sensitive to the whims of the random number generator and what 2,000 sequences you happen to draw. But again, by doing that 100 or 1,000 times, some large number of times, then we kind of average across the luck factor of the random number generator. So people don't like rarefaction for a number of reasons. One reason that frequently comes back to me is that people feel like they're throwing away data. So again, 2,000, 200,000, people think, well, you know, with that 200,000, we're basically throwing away 198,000 sequences. Which I would say, well, that's not really true because when we rarefire the data, we're using all that data to help us to calculate a distance to that smaller sample. Another critique that people commonly have of rarefaction is that when I go from 200,000 down to 2,000, there can be a fair amount of variation in the distance that I calculate relative to that other library. Because again, you're grabbing 2,000 out of a much larger pool, there's going to be some variation there. That variation then gets lost when you start comparing distances. One approach to overcoming the perceived limitations of rarefaction is to calculate relative abundances. In my mind, relative abundances doesn't really gain you anything over not rarefying your data, because the limit of detection is effectively the same as it was originally. And what I mean, again, if we take the total number of sequences in each sample and use that to divide the counts across your populations in that sample, well, you haven't really changed anything, right? And so if you had 200,000 sequences, you're going to still see far more populations in that community than if you'd had 2,000, right? So in my mind, that doesn't really change anything. But in today's episode, we'll dig into that, and we'll see what the effects of using relative abundance are on calculating the distances. A second approach is normalization, where you take the relative abundance data, and you multiply it by a constant, say the size of your smallest library. You then multiply that constant across all your relative abundances, you perhaps turn all those numbers into an integer, and then you've got your normalized data set. Returning to those critiques of rarefaction, the first being that you lose a lot of the data, right, going from 200,000 down to 2,000. Well, with normalization, I'm not sure we're gaining anything there, right? So we'll still end up with 2,000 sequences from that larger sample, but perhaps we've gotten there by a different route. The second critique of rarefaction was that we have varying levels of variation for each of our distance values or whatever value we're measuring by rarefaction. I'm not totally sure that that actually goes away. I think normalization maybe hides some of that. Anyway, what I want to do in this episode is show you how we can calculate relative abundances from our data, as well as use normalization, as well as a special version of normalization that's been published in the literature, to then calculate our distances and compare it back to what we had done previously for no rarefaction, as well as for rarefying our data. Let's go ahead over to our studio now and we'll get going on implementing these approaches into our distances dot r script that we've been working on on the last few episodes. If you want to get a hold of distances dot r, as well as all the other scripts in this project and the code down below in the description, there's a link. Again, we have this distances dot r script, where we read in a shared file from a mouse based data set that I had. My approach is to calculate a rand shared file, a rand otu matrix, where I'm effectively saying that I have 200 some samples that are all a statistical sample of an overall community. So statistically, there should be no difference between each of the samples in my rand matrix. We then went ahead and we calculated a distance matrix using veg dist by Bray Curtis without rarefaction, as well as with rarefaction using the AVG dist function. And we are drawing 1826 sequences from each of the samples. And then we went ahead and we made the plot. I'll go ahead and run all this and show you what the figures we got looked like. And so again, what we found before is that without rarefaction, this plot on the top, the distance between all pairs of samples varied wildly, right, between like, I don't know, just above zero to about one. And it increased as we increased the number of sequences that were different. So that, you know, that 2000 200,000 that difference would be likely to put something up here in the upper right corner of the plot. Whereas if we verify the data, we saw no relationship between the distance between samples and the difference in the number of sequences in those original communities. So this is a good result. So what I want to do is add two panels to this figure, where we have relative abundance, as well as normalized data. So I'm going to come back up into my code. I've got this block here where I define the variable comparison. This is kind of my big data frame where I'm comparing the different approaches. So up here around line 80, I'm going to work with my rand data frame to go ahead and make it a relative abundance data frame. So again, looking at rand, we again recall that we've got group, we've got the rand name and the n. I would like to go ahead and then convert that n into a relative abundance. So I can do group by group and then mutate relabund equals n divided by sum on n. And so now I've got a column with relative abundance. I'm also still grouping my data. So I'll do an ungroup. So I'll go ahead and do a select for my group, my rand name and my relabund. And now what I can do is I can pivot this wider to make our data frame. So I'll do a pivot wider on names from will be a rand name, and then values from will be relabund. Forgot an underscore here in relabund. Very good. Something else that I know I need from last time was values fill to be zero. And so this is because when we have the long data frame, there's going to be combinations of groups and otus or groups and rand names that are missing. And so when we go wider, those will get filled in with na value. So I don't want na values. I'd rather that be filled in with a zero. And then I need to convert this to as a data frame. So I'll do as dot data dot frame, and I'll call this relabund. And now I need to take relabund and turn it into a matrix. So I can feed that in the veg dist. So I'll do row names, relabund. And then this will be relabund, dollar sign group. And then I want to remove the group columns. So I'll do relabund equals relabund minus the first column to get rid of that group column. And then I need to make relabund a matrix. So I'll do relabund matrix equals matrix or as dot matrix on relabund. So now I can calculate my distance. So I'll do relabund dist. And I'll do veg dist on relabund matrix method equals bray. And I'm going to look back up at the top here to make sure I use the same arguments here for the non rarefied distance matrix. And I'll just to keep my names consistent, I'll do relabund dist matrix. And so then if we look at relabund dist matrix, I see sure enough that I've got my distance matrix. The output is pretty hideous. But if I were to take relabund dist matrix and pipe that to str, I can then see that sure enough, it's of type distance. And it is a two dimensional matrix. So we're in good shape there. Now I need to get this into shape so I can feed it into my inner join to add to the comparison. And maybe what I'll do is I'll take my no rare dist code down up here, bring it down and modify that for my relabund dist matrix. So I'll take relabund dist matrix, and I'll make it relabund dist tibble. If I was doing this for real, so to speak, for research purposes, I'd probably go ahead and turn this into a function because I'm basically repeating the same chunk of code multiple times. That's not dry as we say. But let's go ahead and look at relabund dist tibble. And we see that we've got the sample, the name and the value. So that's good. So now we're ready to go ahead and join that into our comparison. And so we'll do another inner join. And so we will take what's coming through the pipeline and join in the relabund dist tibble. And then I'll do by sample and name. And again, I'm doing sample comma name because I want to join on both columns. And so if I look at the output of running these two lines, I now see that I have sample name value x value y value. And so this is getting a little bit confusing. So maybe we want to clean this up a little bit. And that's sure enough what's on the next line here, right? So I need to go ahead and pipe this into the select. And then I want to add relabund equals value. And we now if we look at comparison, we then see that we have sample name no rare rare relabund and diff. So I'm going to go ahead and remove this code. This is where I was comparing directly the distances by the two approaches. Maybe I'll just comment it out for now rather than remove it. So then I want to pivot longer the columns, no rare, the non rarefied, the rarefied, and the relabund. So it's giving us the three panels. Let's go ahead and make three rows instead of two. And what I'd also like to do is free up the y axis. So we can do scales equals free y. And so this will allow the y axis to vary by each of the three panels. And so again, what we can see is that as we saw before with the nowhere, this increases with sampling depth, rarefied is flat, the relabund. So as the difference in the size of the two libraries being compared increases, the distance decreases. And so that's not great, right? We don't want our prediction of the dissimilarity of the two samples to vary by sampling effort. So now let's come back and think about normalization. So an alternative to relative abundance would be to normalize the data. And so we could take relabund, and we could then turn that into a tibble, where we would then do as tibble, and then row names equals on group, which gets us back to basically what we had before we turned it into a data frame, right? And then we could go ahead and we could pivot that longer on everything but the group column. Right. And so hopefully we're becoming pros at this pivoting longer and wider. So now we have our relative abundances. And we want to multiply that by the size of the smallest library. Now way back up here, I had I think it was 1826. That's hard coded. Let's go ahead and see if we can put in the actual value. And so we had a data frame up here, Rand group count. So Rand group count looked like this, right? And so what we really want would be to say this small smallest group. And we could then say min Rand group count dollar sign n. Right. And so then if we look at smallest group, 1828, not 1826. Ah, and I botched that here, right? So this should be smallest group, different by two, who cares. But we can use that smallest group now as the normalization factor for our relative abundance, right? So we can add a scaled column by using mutate on scaled, and it will be value times smallest group. And so now what we get is group name, value and scaled. And what we see is that we have a fractional number, right? So 44.9 49.9, right? And some of these are smaller, right? So what we could perhaps do would be to do like round. And we could like round to zero digits to the right of the decimal point. And so this gives us integers, right? So that's cool. Let's see what this does to our overall group count. So we could do a group by group. And then we could do summarize and or n scaled, right, equals some on scaled. And so then we get our scaled numbers. And ideally, all of these would be the same size as the smallest group, which again we said was 1828. And what we see is that none of these actually, at least the first 10 are not 1828. And so this process of multiplying by a constant and then rounding has some issues to get over this problem. Some smart people published a paper, improve normalization of species count data and ecology by scaling with ranked sub sampling SRS application to microbial communities. And they're kind enough to make a package called SRS. So let's go ahead and use SRS to generate a normalized version of our data frame. First, we need to go ahead and do library SRS. It's in all caps. If you don't have that installed, you need to make sure it's installed, and that you go ahead and load it with the library function. We can come back down to where I was doing this work with relative abundance here. I'm going to go ahead and I'll comment this out. So it's in the code if you want to take a look at it later. So it's going to be normalized SRS. And looking at the help page for SRS, I see that the function is SRS, obviously, it takes data and C min data is the data frame. C min is the number of counts to which all samples will be normalized. So this will be my smallest group value. But the data frame needs to be columns as samples and rows as counts of species or ot us. So I need to transpose it because if I look at rand df, I've got my ot us in my columns and I need that to be the transpose, right? So if I do t rand df, I now get my ot us in my rows in my samples in my columns. So I can go ahead and feed to SRS t rand df. Something I've noticed in the past about the t function is that the output tends to not be a data frame, but perhaps be a matrix. So let's double check that. So let's do that. And we see sure enough, it is a matrix. So I think what I'll do is pull out rand df, we'll pipe that to the t function. And then we'll pipe that to as dot data dot frame. And then we'll pipe that into SRS. And we'll then say C min equals smallest group. And so now we get normalized. Right. And so then again, we have our samples across the columns and our ot us are the rows. We lose the names of the rows, but who cares? I'll create a ot u column. But to do that, we'll do as tibble, row names equals ot u. And then we can pivot longer. Everything but ot u. And again, looking at normalized, we see that yep, we've got ot u name and value. And then we want to group by a name, and we'll pipe this to summarize, I'll do n equals sum on value. And then if we again, look at normalized, we see everything is eight to eight. That is good. And so the normalization worked. Every sample has the same number of sequences as the authors indicated. So I'm happy with that. I'll go ahead and remove these columns. And so now we have our normalized data frame, we need to go ahead. And again, normalized has the columns being the different samples and the rows being the different ot us, I need to transpose this to feed it into veg dist. And so we'll go ahead and do a t on that. And the normalized will be a matrix, right, which is the input to veg dist. So that's good. And we could then do normalized dist matrix, veg dist on normalized method equals Bray. Again, this gives us normalized disk matrix, piping that here, we get all that good stuff. And so now we're ready to again, copy down the code. As I said, it's not super dry. But we're exploring here, right? It's like a rough draft of writing a paper. And so I will do normalized dist tibble. And then we need to pipe to that from that this matrix into as matrix. And so now if we look at normalized dist tibble, we get a sample name and value good, we're now ready to bring this into our overall comparison. So I'll go ahead and copy this inner join instead of relevant. I'll call this normalized dist tibble sample name. And the the column names are going to change probably a little bit. So I'll run these first three lines. So we get value x value y, that's good for the no rare and the rare value xx becomes relevant. So value, period x period x. And then we also need normal normalized equals value y y. So let's double check that all looks good. And I need a period between the two y's. And sure enough, we now have the good column names. We can pipe that into the rest of comparison. And then here we need to add to our pivot longer are normalized. And we also then need four rows. So what we see is that our normalized data looks a heck of a lot like a relative abundance data, and that the variation in the distances decreases as the difference in number of sequences per sample increases, right? So that 2000 to 200,000, that would be out here, you know, further way out here, that's probably a bit of an extreme example. But those two would look really similar to each other. Whereas 2000 to 2000 would look, you know, more different from each other. And that's not right. Because again, these are samples from the same overall community structure, there really shouldn't be a difference. Now, because we're sampling overall communities, we are not going to get zero distances, because it's not a perfect representation in that sample of the overall community. We'll come back to this issue of worrying about the impact of our metrics varying by sampling effort later. One thing that I will have you keep in mind is imagine that there was some reason that for one treatment group, you didn't get a lot of sequences out of those samples, but out of another treatment group, you did get more sequences out of those samples. I've had this case where we're sequencing lungs and comparing them to people's mouths, right? So it's a lot easier to get sequences out of a person's mouth than their lungs and the lungs. There's a lot of contaminating DNA. And so you're going to get fewer reads, or we did get fewer reads in lungs than mouth. If there was no difference in the community structures between the lungs and the mouths, my concern is that we would see differences if we use something like relative abundance, normalized, or no refraction, because the difference in the community is going to be driven by how deeply we sample them, not their biology. So we'll come back to that. But in the next episode, I want to explore this factor a little bit further of how the number of reads we have impacts the distance between the samples, as well as the variation in those samples. So we'll do a little bit more exploring with this refraction data to better understand how it's working. So that you don't miss that episode, please be sure that you subscribe to the channel and you've clicked that bell icon. I know this series isn't too much of a tutorial. It's more of a demonstration of how to use a lot of principles from GGplot and Dplyr that we're very familiar with to do an exploration of a ongoing question in the field. And so I really hope you feel empowered by this, right, that we can use our to answer questions about, you know, how we analyze our data. Perhaps it's a little meta. But again, it also gives you a chance to see how I explore data with these tools. We'll see you next time for another episode of Code Club.