 Hey folks, if you've been watching recent episodes of Code Club, you know that I've been spending way too much time thinking about how to calculate Bray Curtis distances between different communities. In my case, I'm looking at the microbial communities that we sampled out of mice at our research facility here at the University of Michigan. And so the question is, do we rarefied? Do we not rarefied? Do we normalize? Do we use relative abundances? Well, in the last episode, I compared the pairwise distances between random data. So in other words, what I did is I took the original mouse data and then I pooled all the data together and then I redistributed those points randomly so that they all represented a statistical sampling of that composite community. If you don't understand what I'm saying, basically what I have is I have the same number of samples with the same number of sequences per sample, but instead of representing individual mice and individual time points from the study, that they really represent a overall sample of what the full community might look like. It's a simulation and in this simulation, it's what's called a null model. So we don't expect to see any differences between the different samples. This is useful for exploring things like, you know, how do you calculate an ecological distance between samples and what are the properties that you find, you know, when you look at different components or different factors of those distances. Well in the last episode, I looked at what happens when you don't rarefy, when you do rarefy, when you look at relative abundances and when you normalize the data. This was the figure that we came to at the end of the last episode, right? You can see that with relative abundance and normalization, the data looked very similar. Here represents a different pairwise comparison between about 225 different communities. Rarefied data, on the other hand, is fairly consistent across all numbers of differences. And again, the x-axis is the difference in the number of sequences for each of our pairs of samples. And the data that aren't rarefied, I don't even want to talk about because as you increase the number of sequences that differ between the two samples you compare, the distance between them goes really high. I might want to know, well, is that actually true? This looks like a big black mess here for normalized. Is it actually decreasing or is it flat? Because we have this big cloud of points for comparisons where the difference in the number of sequences between the samples was less than about 5,000, even 1,000, this looks like a big dense mass. And so it's hard to know if the variation there is a lot wider than it might be further out, you know, when you have a larger difference in the number of reads per sample. So what we're going to talk about in today's episode is how we can take that x-axis, the n-diff, and discretize it so that we can have categorical variables in the x-axis representing different bins on the x-axis for different windows into the number of different sequences between the two samples that we're comparing to calculate each of those differences. We'll do all that in our distances.r script. If you want to get a hold of this script as well as the data I'm working with, down below there's a link in the description to a blog post. Also up here, I'll put a link to a video that'll show you how you can use all that information in the blog post to get caught up to where we are today. So let's go ahead and look at this comparison data frame. And as you'll see, the first column is sample, the second is name. These are poorly named columns that basically indicate the two samples that we're comparing to calculate distances, break-critice distances calculated four different ways. And so in no rare, there's been no rare faction done rare. We used rare faction where we calculate, we grabbed a certain number of sequences from every sample, calculated distances, put the sequences back, and we repeated that like 100 times and then calculated the average. So that's what's going on in that rarefied column, that that's a sampling of a same number of sequences from every sample, but we're repeating that like 100 or 1,000 times. Relo-bund, we calculate the relative abundance on non-rarefied data, but we calculate the relative abundance and then calculate break-critice. In normalized, we normalize every sample to have the same number of sequences and then calculate break-critice. And then at end-if, this is the difference in the number of sequences between the names in the first two columns, right? So in other words, the difference in the number of sequences between F3D0 and F3D1 was 1,531 sequences, okay? So again, that's the interpretation of the data. The first thing I want to do is I'm actually going to step back from these plots where we're plotting end-if on the x-axis and I want to actually make a density plot of these distances to kind of see how symmetric the data are. We've never really done that. So I'm going to go ahead and grab these first two lines where we're pivoting those four distance columns longer so I can then treat those as separate variables. And so again, what we see now is that we have sample name, which we really don't care about. The number of sequences or the difference in the number of sequences between those two samples, the type of calculation and then the distance, right? So then what we can do is we can feed that to ggplot, AES. On the x-axis, I'll put dist and then I'll do fill equals type and then I'll do geom density. So what we have now is a density plot. This is effectively the same thing as a histogram, except with geom histogram, it stacks the values. If you have multiple types like we do in this case, whereas with geom density, it puts them behind each other so that everything comes down to zero on the y-axis. This is a bit hard to see. So let's go ahead and zoom in on our density plot and I'll do that with chord cartesian and I'll do y-lim from zero to 50 and then my x-lim I'll do from zero to say 0.25. So this does zoom in pretty well. You can see in the background that red for the non-rarefied data and my teal spike here for my rarefied data. It's a bit hard to see in this case, but the relative abundance and the normalized plots are basically on top of each other. What we could perhaps do would be to say something like alpha equals 0.5 and let's do color equals NA. So we can kind of see that there are two curves here, the green and the purple ones overlapping with each other. Maybe it's a little bit easier to see if we do theme classic. Yeah, again, they're like right on top of each other. It's gonna be really hard to differentiate those two curves. Anyway, what we definitely see is that the rarefied data has a lot less variation, but has a higher average distance whereas the relative abundance and the normalized data have a lower average, but a wider distribution. It's also not clear if these curves are symmetrical. I kind of feel like the curve over on the right side, that shoulder is a little bit thicker for the relative abundance and normalized data than it is on the left side, which would tell me that there's perhaps something going on where we're getting larger distances than maybe we'd expect, right? So returning to this plot where again, we have the difference, the number of sequences between the two samples on the x-axis and then the y-axis being the distance. Again, it feels like those values are dropping off. One way that we could look at this, at least visually, would be to do geomesmooth. And so geomesmooth, by default, will add a spline through the data. And again, what we can see is that for the normalized and the relevant, that the curves do seem to be falling down, I guess from your perspective, falling down, although later on when there's perhaps not as many points, they do kind of pull back up, whereas the curve for the rare is pretty dead on flat, right? And so what I'd like to do is look at this a little bit more in detail by breaking up the x-axis into bins so that for every bin, might only be a difference of, say, 500 sequences between the two samples being compared. For each of those bins, I could calculate the mean and the standard deviation. So I could look at then how the mean and the standard deviation vary over the spectrum of comparisons that I had, right? So to do that, I will go ahead into my code here and I will remind you what we have, looking at lines of 140 to 142. We again have the sample, the name, which we don't care about, the number of diff, the difference, the number of sequences, the type of distance, and then the distance itself. And so this n-diff column, I want to discretize into bins. And so to do that, what I could do is mutate diff cat equals cut width, and then to cut width, I will then give n-diff. And my width that I wanna use, let's say is 500. And so if we run these three lines, so what we see is that we have everything we had before, but now we have this extra column of diff cat. And so this is a factor, it's a categorical variable. The parentheses, well, first of all, so this means that there's a range, right, between 1,250 and 1,750 sequences. And so sure enough, 1,531 is in that range. The parentheses means that it's open, so it doesn't include 1,250, but it is closed on the right, so it's got that square brace, meaning it does include 1,750, okay? And again, something you could do is you could pipe this to count on diff cat, to count the number of cases of diff cat we have, right? So what we get is a count of the number of distances that fall into these different categories for, again, the differences in the number of sequences. And so what we see right away is that these categories are centered on zero, and they're plus or minus 250 to give you a range of 500. So I don't want it to be plus or minus 250, I want it to start at zero and go to 500. And so I can easily change that by modifying the arguments to cut width by doing boundary equals zero. And so now when I look at my output, I see that sure enough it starts at zero and goes to 500. Now there's a variety of cut underscore functions, so there's cut interval, cut number, cut width. Again, we're using cut width, where we're defining the width that we want and whether we wanna center it on a value or whether we want it to boundary on a value. There's also cut interval that makes a certain number of groups of equal range. Cut number will allow you to make n groups, a certain number of groups that are all of the same size. And so, again, what you wanna do is up to you, but for my purposes, I want to indicate the width of each of my slices, each of my categories and move on from there. Great, so now we have our discrete values. And what I'd like to use is to use those discrete categories to group by and summarize our data to calculate the mean, the standard deviation and the number of sequences in that category. So I could do group by diff cat, but I also want to group by the type because I wanna keep the type of distance separate as well as the category separate. And so then we will do summarize, we'll do the mean on distance dist and SD to be the standard deviation on dist. And then I'll do n with the n function. And the n function doesn't take any arguments and that tells you how many observations there are in your categories. And so sure enough, what we see is we've got our difference category, right? Zero to 500 for our four different types of distances, our mean or standard deviation in our n. That's great. We are still grouping it by diff cat. I'd like to go ahead and get rid of that just so I always know what things are being grouped by. I could do dot groups equals drop. And sure enough, again, we now no longer have that grouping. I can then pipe this into ggplot and on the x-axis I can put my diff cat on my y-axis. Let's go ahead and put mean and this then will be a geom line. And so because I'm using geom line, I need to go ahead and group by the type. And let's also go ahead and color by the type. I'll go ahead and remove the facet wrap and the geom smooth. And what I find is I have my mean distance and then on the x-axis again are those categories and that mean distance for the non-rarify data is out of control. I think I'd actually go ahead and remove that because I think we've established that we need to do something that no rarify really isn't an option. So to remove that, again, I can come back here and I can do a filter on type not equal to no rare. Pipe that into everything else. And let's go ahead and add theme classic because otherwise we're gonna get these crazy grid lines in the back that are just gonna be annoying. And so now we have a cleaner line plot. Again, on our x-axis, we have those difference categories. It's just a jumbled mess. We should come back and clean that up later. I'm not gonna worry about it for this episode, but if you use your scale x discrete or scale x continuous, you should be able to figure out how to clean that up a bit. And so again, the green line are the rarefied data nice and flat up around 0.12. The normalized and the relevant distance does drop as the difference in the number of sequences between the two samples being compared goes up, but then it goes up. And so again, I'm thinking that this probably has a lot to do with the N. And so we could go ahead and just straight out plot a little line for the N, but I'd rather do is let's go ahead and use the summarized output to pivot longer and then facet by our three different metrics. So let's remind ourselves what the data look like at this point. Again, we've got diffcat, type mean, SD and N. And so I'm gonna wanna do pivot longer. And the calls that I wanna pivot longer, mean, SD and N. And so now we've got our difference category, it's type, the name and the value. That's good. Again, we could go ahead and put in fancy names for the name and the value column, but I think we're in good shape. And so then on the X axis, I still want my diff category. Y instead of the mean, I want value. I'm gonna group by type, color by type, I'll do geom line, but then I'll also wanna do facet wrap and we'll facet wrap by a name. So the type of metric. And then I'll do N row equal three. And what we see is that the Y axis scale goes up to 2000 on all these because it uses the same Y axis scale on all the values. I don't want that. So we'll go ahead and do scales equals free Y. And you know what? I'm also gonna modify this because I'd rather have the N at the top. Again, it probably doesn't matter a whole lot, but it makes me feel better. So after my pivot longer, I'll go ahead and add a mutate where I will take the name column and I'll mutate that to be a factor on name. And then I'll set my levels to be N, mean and SD. I feel like I've got way too many parentheses here. Yeah, pipe that into GG plot. And now what I see on the top is my N. I've got my mean and my standard deviation somewhere along the way I lost my theme classic. That's okay. And so what we can see is that the N really does fall off here. And so where it falls off, the mean value for the normalize and the relevant kind of starts going a little bit haywire. And so maybe I'd like my data to be focused on those cases where the N is say greater than 100. So I can come back to my code. So between my summarize and pivot longer, I could go ahead and add a filter and I could then say N greater than 100. So that does zoom in a bit for us. And I think what my takeaways are, at least for this particular data set, it'd be interesting to try this with other data sets. But for this data set, when we verify the data, the mean distance is rock solid. It's right around like 0.12. And that is not going to vary because the two libraries that we're comparing for every distance are different sizes, right? Whereas we do see a fall off, a drop in the distance for normalized as well as relative abundance data when we calculate those distances. But when we look at the standard deviations, the standard deviation actually for the rarefied data, I mean, it's not as rock solid as the mean is, but it's pretty solid and it's pretty tight, right? There's not a whole lot of variation between our distances. Again, we are grabbing like 250, 225 samples from a null distribution. And so we would really hope that these would be very consistent. Whereas when we look at the normalized and the relative abundance data, they're pretty erratic and the standard deviation is actually a bit higher. So the take home, I guess, is that the average distance when you rarefy is a bit higher, but the variation is much lower. Whereas the variation, again, for normalized and relative abundance is a lot higher, but the average distance is lower. And both of these metrics, the mean and the standard deviation for the normalized and relative abundance data are being driven in some part by the total number of sequences in the two different libraries. Again, if they vary in the number of sequences, then you get, I guess, a drop in the average, which seems a bit weird and kind of a drop in the standard deviation as well. So again, for my money, for my use, I'm happy with having perhaps a little bit higher distance, but having much less variation in the data. And so again, that's another reason I think rarefaction has something going for it over normalizing and using relative abundance data. One other thing I wanna show you here while we're looking at this particular facet is that I noticed that the y-axis doesn't go down to zero and perhaps you'd want it to go down to zero for some of these. What we could do is we can go ahead and add chord cartesian and then we could say ylim equals c as a vector going from zero to na. And so what that'll do is that'll make zero, the lower bound on the y-axis for all three of these panels and it'll allow the upper bound to be set by the data. So by setting that lower bound, we get a little bit more context for the difference in our mean and standard deviation by these different approaches. Again, I'm pretty happy with the performance of the rarefied data, at least for this data set. It would be interesting to see what it looks like for a different data set. I'll leave that for you to try maybe with your own data. See if you can replicate what I'm doing here with my mouse data, with whatever data it is you're working on for your project. I'd be interested to see how it compares to what I'm finding here. That'd be a pretty cool project. Anyway, practice with this, try that homework assignment and then next episode we'll come back and we'll look and see what happens if we, instead of rarefying to 1,326 or whatever it was, sequences per sample, if we'd have done 2,500 or 5,000 or 10,000. Of course we'd lose samples because not every sample had that many reads, but what happens to the overall distances? Does this green line go down as the minimum library size goes up or does it go up? I don't know, we'll have to figure that out in the next episode of Code Club. Make sure that you've subscribed and you've clicked that bell icon so you know when that episode comes. Keep practicing and we'll see you next time for another episode of Code Club.