 Hey folks, if you watched my last episode of Code Club, you know that I talked about how we can use VEGDIST and AVGDIST from the vegan package to calculate ecological distances between different sets of samples that we have. In my case, I'm looking at the fecal microbiota from a collection of mice that we sampled over the course of a year, about 10 or so years ago. The difference between VEGDIST and AVGDIST is that the AVGDIST does a refaction step, where it draws the same number of individuals from each sample, calculates say like the Bray Curtis distance, and then puts it back, puts those sequences back in and then re-samples those sequences, calculates the distances, and repeats that about 100 or 1000 times. And then the AVGDIST, what that AVG means is it calculates the average distance across those 100 replicates, whereas VEGDIST does a single distance calculation using all of the data you provide it. In all of my papers, we verify our data when we are doing alpha and beta diversity. Many other more vocal people in the microbiology field say that this is problematic. So what I would like to do is to take a step back and use the R skills that we've been developing here on Code Club to explore this a little bit further. I will start out in today's episode by looking at beta diversity with and without rarefaction. In the next episode, we'll look at a couple of different approaches that people suggest using to normalize the data, or to get around this problem or so-called problem of rarefaction. And then in the third episode, we'll look at alpha diversity metrics. And who knows, we might do one or two more episodes as well. My contention is that we need to treat all of the samples on the same basis. Now, people that do normalization claim, yes, we're doing the same thing. And I think that normalization actually introduces an artifact that we don't see with rarefaction. People that oppose rarefaction say that, well, you're throwing away a lot of data. And I'd say, I don't think you're actually throwing away the data using all that extra data to help you to calculate an average for a lower number of sequences. Right? And so, again, the challenge that we face is that a lot of the metrics we use are sensitive to sampling effort. My general critique of papers that I've talked about refraction in the past is that they use very idealized communities. In general, they might take five or six communities out of a whole bunch. And those communities might all have a very similar number of reads per sample. And they kind of use those to build out additional communities that all, again, have a very similar number of reads. We know from previous experience that the number of reads per sample in a study can vary typically at least one and maybe two orders of magnitude. That's quite a bit of variation in the data. The other thing that people tend to do is that they use real data. And they use, which is good, right? We want to use real data, but they use real data without knowing what the actual effect size is, right? And so what I would like to do is use real data, use real distributions of the number of reads across samples. And I would like to know what the community structure really is. And so what I'm going to do is build what's called a null model. And so what I will do is I'll take my shared file, where again, I have the number of sequences at each sample. I'll also know how many times each taxa shows up across all the samples. And I could then randomize the counts across all O2s and all samples so that I maintain the row and column sums, right? So if this doesn't quite make sense, hold on, it will soon. And what that will mean then is that if I take two samples and compare them, that they should be very similar to each other, they should, they should, in an ideal world, they should be identical to each other, right? Because they're effectively being sampled from the same community distribution. But the benefit of my approach is, again, we get that natural variation in the number of reads per sample, as well as some noise thrown in. So to do that, we're going to start with distances.r from the last episode down below in the description. There's a link to a blog post for this episode that will give you the starting point and ending point of my GitHub repository for this series of episodes. Also, across the top here, I'll put a video link that will help you figure out how to how to do all that, right? Anyway, we've got distances.r. And this was the script we used in the last episode, I will build off of this script in this episode, where we read in the data from the shared file for mice.shared. Again, this is a O2 table where the rows are different samples. The columns are the different taxa and the values then are the number of sequences. For the number of times we saw each taxa in each sample. And we read this in. And in the previous episode, we converted it to a matrix, which then was the input to AVGDist as well as VEGDist. I'm going to go ahead and read in the tidyverse and vegan libraries so I can have all that great functionality at our disposal as we work through this project. I'll also load the days wanted variable or the samples I want from each of the mice. So I want to look at this pipeline for generating shared, which in this case generated a data frame. And I want to keep it as a table, because I want to use this shared, the observed shared file as the basis for creating my null model. So let's see if we come all the way down through this select where we remove the total column, I want to see what the data look like. And so we see that we have the group name, the sample name, the O2 name, and then the value of the count. And so I think that's exactly where I want to be. So I'm going to go ahead and remove these other lines. And I'll go ahead and get rid of everything else. And so this then will be read in as the shared object. And again, if I look at shared, I see I've got my group, my name and my value. So what I'd like to do is take each row of this data frame, namely the first two columns, and repeat that row by the value in the value column, right? So I want to have 115 rows, where the value in the group column is f3d0, and the value in the name column is otu 0001, right? And so I'll get 115 of those, and then 26 of the following and 37 of the following, right? Then what I can do is I can randomize the order of the name column, and then recount everything, right? And so what that will do is that will preserve the number of individuals in each group. It will preserve the number of individuals in each otu. And then it'll give me the guts, if you will, to go ahead and make a new shared file that will represent my null model. So to do this, we'll take shared, and I will use a new function that I don't think I've talked about before called uncount. And we will do that based on the value column. Now that we've unrolled the shared data frame, what I'd like to do is go ahead and randomize the order of the values in the name column. So we'll go ahead and pipe that to a mutate. And I'll create a random name column. And that will be the sample function on the name column. And so if I just to kind of quickly show you if I do sample 1 to 10, so then I get those 10 numbers randomized without replacement, right? So each number only shows up one time. And that's exactly what I want to do here. And so with my real data, now I see that my random name column has been randomized, I can then pipe this into a count on group and rand name. You know what? First, I'm going to go ahead and do a select to remove the name column. So now we see that female three day zero ot1 showed up 131 times. Whereas I think way back up here, it showed up 115 times. I'm going to go ahead and call this rand as my object. And so to prove to myself that my randomization did what I think it did. I'm going to go ahead and take shared. And I will do a group by my group, and I will do a summarize that n as the sum of value of summing all the sequences in my samples, each of my samples, right? And so I can see I see how 6229, right? So that's good. I'm going to call this shared group count. And I'll do the same thing with rand. So I'll do, you know, I'll grab this and bring that down. And I'll call this rand group count. And where it's going to be piped in rand, you know, I'll go ahead and put these together, kind of laid out the same way near each other. And I'm getting an error. So let's see what should rand be? It's and it's not value. So I'll do that. So now if I do an inner join on these two on shared group count and rand group count by group, I can then see that my values in the two data frame are the same. So this checks out, I could do the same type of thing. But instead of grouping by group, I can group by the OTU. So again, I think that was name, yeah, name in the shared. And then I'll call this shared OTU count. And then rand, I think it's something different. So that's rand name, right? And so I'll do group by rand name on that. And so again, this will be rand OTU count, loading those placing group with OTU, OTU. And I know it's not the group column that I want to join on, it's going to be name. And then that's rand name here. So name equals rand name. And so again, if I look at the counts for each of my OTU is the same, right? So I have effectively created a null model where again, I'm preserving the number of sequences in each sample, as well as the number of sequences in each OTU. Effectively, each sample is a sample, if you will, statistical sample of an overall community distribution, where I'm preserving the total number of sequences across each OTU and the total number of sequences in each sample. So to get a sense of what that distribution looks like for our samples, I can take the rand group count, pipe that to ggplot es, where x will be n. And then we'll do geom histogram. So now we can see the distribution of total number of sequences per sample. And we see at the low end, it was about 1828. And the high end was over 30,000. So that's about a 20 fold variation in the total number of sequences we had in each of our samples. So now we have our rand shared data frame, right? And we want to turn this into a matrix, right? And so we can then go ahead and pivot this wider. So we'll do pivot wider. We'll do names from equals a rand name, and then values from will be n. And so now we can see that we have our samples are in our rows. And our columns are the otus. And as we've seen before, we want to convert this to a matrix so we can feed it into veg dist and AVG dist. But a matrix can't have different types of columns. And so we have group as a character and otus as integers. And so we need to go ahead and convert this to a data frame, remove that group column and then make it a matrix. So we can go ahead and pipe that into as dot data dot frame. And I'll go ahead and call this rand df. And then we can do row names, rand df, and that will equal rand df, dollar sign group. And then we can do rand df equals rand df, and then comma minus one to remove the first column, which is the group column. And so now if I look at rand df, let's see, there's a whole bunch of columns, we know. But that first column is otu one rather than group. And we're in good shape with rand df being a data frame, where the columns are all of the same type. And we can then make rand matrix to be as dot matrix on rand df. And so now we can do str rand matrix to see that rand matrix, sure enough, is a matrix. It's a vector of integers that has two dimensions. And that's again, what a vector is. And we've got our rows and our column names. So we're good to go now to feed this into veg dist, as well as AVG dist. So the first distance I'm going to calculate is all call no rare dist. And that will then be veg dist on rand matrix. And then method will be bray. So I'm getting an error message that there are NA values in my matrix. And I think what's happening is back here in pivot wider, I'm going from rand, which is a long narrow data frame to a wide short data frame. And to kind of make that transition, there were values missing in the long version that need to be populated for the wide version. And so by default, pivot wider fills in those missing values with NA is I would rather those be filled in with zeros. So we'll come back up here and we'll do values fill equals zero. And so instead of NA values, it'll fill with zeros. So we'll go ahead and run these again. And sure enough, it runs without any errors. And so now we have our no rare faction distance matrix, right? And I'm actually going to call this disk matrix. And I'm going to do the same thing, where I will do a rare dist matrix. And this will be AVG dist on rand matrix. And we'll do D method equals break. And then I will do sample equals 1826, which was the size of the smallest library in this data frame. And so now I have my two distance matrices, right? So I have no rare disk matrix that was calculated without rare faction. And I have rare disk matrix was calculated with rare faction, rarefying to 1826 sequences per sample. Now I want to be able to take these two distance matrices and compare the distances to each other to see if they're the same or if they're wildly different from each other. So I'll go ahead and start with no rare disk matrix. So I'll go ahead and turn this into a matrix because it's currently a distance matrix. So I'll do as dot matrix, then I'll do as table, where my row names will be sample. So now I have a table where the first column is sample with all of the sample names. And then all the column names are also the sample names. So I want to go ahead and make this tidy and only look at the lower triangle of this distance matrix. So again, I can do pivot longer minus sample, this gets me my sample name and value. And I will then do a filter for name less than sample, which will again give me the lower triangle on that distance matrix. And so now I've got my no rare disk matrix, I'm going to go ahead and call this no rare disk tibble. And I'll want to do the same thing, but with the rare disk tibble. So I'll go ahead and turn these no rares into rares. This isn't super dry code, but it's okay. So now what I've got are my two tibbles, and I now want to join them together, keeping in mind that my two columns are sample and name. So we can then do inner join on no rare disk tibble, and rare disk tibble. So now we want to join the two data frames on the sample and name column. I think I've shown you before how to join two data frames on the same column. So sample. And I've shown you how to compare join two data frames if the same column has different names and two data frames. So this is a little bit different, we're going to join two data frames on two columns, right? So we'll do C, and then parentheses, sample name. And so this will join the two data frames together. So we get sample name, and the two values. And so we can go ahead and rename these by doing a select, right? So we can do select sample name. And then I can say no rare equals value dot x, rare equals value dot y. And so now I've got my cleaned up names. And we're now ready to plot something, right? So let's go ahead and pipe this in a ggplot. We'll do AES. And I will do, well, let's let's look at the two distances on the x and y axis, right? So we'll do x equals no rare, y equals rare, and I will do a geome point. And so now again, we have the rarefied distances on the y axis, and the non rarefied data on the x axis. And whoa, there's a lot of data, right? But what you can hopefully see is that the average for rare is at about 0.11a5 to 0.12, right? And it's a pretty narrow band from like, you know, 0.095 up to 0.15, right? So, you know, it's not zero, but but it's, it's fairly tight, especially compared to the non rarefied data that goes from, I don't know, say zero up to one, right? So the non rarefied distances are huge in comparison to the rarefied distances. The other thing I'm noticing, and again, there's a lot of over plotting here is that the rarefied distances all averaging about the same, regardless of what the non rarefied data looks like, right? So to me, that is a win. I can make this look a little bit prettier by let's do size equals 0.25 alpha equals 0.25. And I'm going to go ahead and throw on a geome smooth here. So we see the smooth blue line really indicates that those rarefied distances aren't changing much at all for the different values of the rarefied data. So that's, that's pretty interesting, right? So the next thing I want to look at is, is this being driven by the difference in number of sequences for the two samples being compared to each other. So let's go ahead and take our data frame back here. I'll call this comparison. And then my comparison can be fed into this other pipeline. So if I look at comparison, I've got these two columns of sample name, rare, no rare. Again, I want to add in the number of sequences for the sample column, as well as the name column. So I will interjoin to this my ran group count. And I want this to be, I want to take the data coming through the pipeline and join the ran group count to that sample equals group. I now have the n corresponding to the sample column. I need to do another interjoin with the ran group count. I do by, here we'll do name equals group. And this then should give us an nx and ny. So nx corresponds to the sample column ny to the name column. And what I really want is not so much nx and ny as much as the difference. So demutate n diff equals n dot x minus n dot y. And just in case ny is bigger than nx, I'm going to wrap this with the absolute value. And so now if I look at comparison, I get my columns of sample name, no rare rare and diff. And because I'm not interested in those n columns, I'm going to go ahead and do a select. I do minus n dot x and minus n dot y. And so now I'm going to come down to my gg plot and I'll do color equals n diff. And so here I'm using color as a continuous variable. So it's a bit hard to see, but I have darker points, I think to the left and lighter points to the right. So I would like to make a faceted plot where in the x axis, I have the number of sequences that's different between the two samples being compared and the y axis being the distances and I'll have two facets, one for rarefied, one for not rarefied data. So I'll do comparison. So we'll go ahead and pipe this to pivot longer so I can make my data frame tidy, and we'll do calls equals and we're going to do no rare and rare. And then I'll do names to will be type values to will be dist. And so now I can see I have the sample the name, I don't really care about that. I've got the end diff, the type and the dist. So I can pipe that to ggplot as x being and diff y being dist. And then we'll do geom point. And again, I'll do size equals 0.25 alpha equals 0.25. The alpha is the transparency of each point. And then I'll add to that facet wrap. And we'll tilde by the type. And so this then gives it to us side by side. I think we can see that it looks already very interesting. So let's make it column. So we'll do n row equals two. So this is really fascinating. If we don't rarefy before we calculate our distances, then the distance between the samples is driven largely by the difference in the number of sequences in the two samples. If we do rarefy our data, like we see here, we get a very uniform distance, right? It's right around like 0.11, I would say. And it doesn't vary by the number of sequences in the two samples being compared, right? So again, rarefaction is doing its job of controlling for uneven sampling effort. So this is really cool, right? And again, to me, this illustrates that with rarefaction, we have removed the dependency of that variable, the break artist distance on sampling effort. Whereas if we don't rarefy, well, then we get this disaster, right? So in the next episode, what we're going to do is we're going to compare the rarefied results to alternatives that people have proposed, things like looking at relative abundance or normalizing the data. So I hope you found it valuable to see how I think through a problem and trying to think about how to build this null model, how to analyze the null model, how to make sure that it's doing what I think it's doing relative to that observed community distribution data frame, right? And I think the only new function that I showed you today, if you've been with me for the long haul of Code Club is the uncount function, right? So there is one new function out of all the functions we went through today. Be sure that you subscribe and click that bell icon so you know when that episode is released, and we'll see you next time for another episode of Code Club.