 Hey folks! When I think of analyzing microbial diversity from the diversity of microbiomes that we all study, I generally think of two different approaches. In the first approach, we take our sequence data and we assign them into bins, different categories, whether that be based on taxonomy or based on how similar those sequences are. Again, I call those bin-based statistics and we measure things like richness, the number of types, Shannon diversity. We could also look at things like beta diversity, comparing who's there and in what abundance. The second approach that I see people frequently using is what I call a phylogenetic approach. In that approach, instead of putting your sequences into bins, what we instead do is build phylogenetic trees from those sequences and then we look at the structure, the branch length, the ordering of the tree to figure out what branch length is unique to one sample or another. These give rise to things like measuring phylogenetic diversity, which we talked about in the last episode, as well as a suite of tools called Unifrack. Unifrack being short for unique fraction. There are two approaches to Unifrack. The unweighted approach is a lot like a membership-based metric of beta diversity, like something called Jakard. In Jakard, you can think of this as being like the Mastercard symbol, where you've got a pool of taxa from one community, another pool of taxa from another community, and you want to see how much do those overlap, how many of the taxa are shared between the different communities. Again, we don't care about the abundance of those taxa, we just care about what taxa overlap. And again, that's what I think of as membership, and that unweighted Unifrack allows us to measure the similarity or dissimilarity of our communities based on membership. The second approach to measuring beta diversity is what I call structure. This is based on the membership but also based on the abundance of those taxa in those bins. Again, for thinking about something like a Bray-Curtis metric, we might incorporate something like say the relative abundance or the frequency that we see each taxa. In the phylogenetic approach, we'd use a method like the weighted Unifrack, which is a lot like unweighted, except we're weighting the branch length and the shared branch length of the unique branch length based on the frequency of that branch, of the individual that that branch is going out to. So again, these are two different approaches. As our data sets get larger and larger, I find that the phylogenetic approaches just become a lot less tractable because the process of building the tree is just really hard when you've got tens of thousands or hundreds of thousands of sequences. Instead, the bin-based approaches are a lot more tractable because you can use a classifier to put things into bins based on taxonomy. And even, you know, methods like assigning sequences to operational taxonomic units, OTUs are a lot easier. And so the question naturally arises, if I do a bin-based approach, does that cost me anything over using a phylogenetic approach? How different are the results? Also, we might think about, well, how different are the results if we do something based on membership versus something based on structure? Again, membership and structure are two different concepts. And so we might think that they should look a bit different, but do they? And again, this process of looking at things based on bins or based on phylogeny is also an interesting question. So that's what I want to do in today's episode. I've gone ahead and generated the Unifrack based distance matrices in mother, so outside of R here. If you want to get these data as well as all the data that I'll be working with and all the code that we've been working on over the past couple of months, you can go down below in the description. There's a link to a blog post with instructions on how to get caught up. I also have a little tutorial video here that will help you get going as well. So again, what I want to do is I want to compare the weighted and unweighted Unifrack to distance matrices for the same data calculated using Jakard, as well as Bray Curtis. So we'll have four distance matrices that we want to compare to each other. And the main tool that I want to use to compare these matrix to each other is something called the Mantel test. The Mantel test allows us to look at the correlation of different distance matrices. And that is another great tool coming to us from vegan. So with my new R script, I'm going to start with library tidyverse, as well as library vegan. So the first thing that I want to do is review how we can generate the distance matrix for Bray Curtis and Jakard. So as we've already seen, we can do read TSV. And then the data that I want is mice.shared. And that will read that in and I can go ahead and then do a select for group O2U. We get a table out of reading that in and doing the select to only get the group name and those columns that correspond to our different bins. Again, it's a bin based representation of the data. I need this to be a data frame so that it can go into vegan to calculate the distance. So to do that, we'll do column to row names and we'll use the group column to do that. And then we will then feed that in the AVG dist and the sample that we're going to use will be 1800 and four. So we'll get 1800 and four sequences from all of the samples, calculate a Bray distance matrix, and we'll repeat that 100 times and then calculate the average. And then the D method that we'll use is Bray. This then gives me a warning message that there are about a dozen or so samples that were removed because they didn't have 1800 and four sequences. Again, if you want to learn all about this sampling depth issue, look back through the previous episodes in this playlist. So now we need to convert this into a tibble so that we can use all our great dplyr and ggplot tools with it. It's currently stored as a distance matrix. And so to get it into a tibble, we need to do two steps. First, we need to do as dot matrix. This will convert the distance matrix into a matrix. And then to get from a matrix into a tibble, we can do as underscore tibble. And then row names. I'm going to set to be equal to a. And the next step, we're going to do a pivot longer, where we'll take the column names of our matrix and make those another column that I'll call b, because we'll have our row names and a and our column names and b in our distances in a column that I'll call Bray. All right, so let's go ahead and do that. So we can do pivot longer. And we'll pivot longer everything except for a, and then we'll do names to be and then values to let's do Bray. And let's also go ahead and save this to data frame to a variable that will call Bray tibble tbl. So we see that we've got columns a and b for the rows and columns of the matrix as well as Bray for the actual distance. I'm going to go ahead and copy this and then we'll replace Bray with jacquard there and here. So the D method will be typed jacquard and values to here will be jacquard. So now we have the distances for our structure as well as membership based approaches. We now want to get the distances in from the distance matrices for the weighted and unweighted unifracks that I calculated outside of our in mother. To do that, we'll recall that way way value back. We used a lot of base r functions to write a function to read in a file up formatted distance matrix, which is exactly what we have here. So way back up here on line for say, I'm going to run a command source, and we'll do code forward slash. And what we want is the read distance file read matrix read matrix dot r. So if you go ahead and run that, we now if we look over in our environment tab, we see that we've got a read matrix function in here, right? And so now we can use that to read in our matrices. So if I do read matrix, and I do data forward slash mice dot, let's do the unweighted first. So unweighted ave dot dist so that ave is in the file name, because again, this was calculated based on rarefaction, right? And so we basically calculated 100 unweighted unifrack distance matrices, sampling everything to 1804 sequences, and then mother calculated the average. And that's what we've got here. So I forget what this actually has it read in as. So let's go ahead and pipe this to glimpse. And so we see that this is actually a vector. So if it was been a distance matrix, it would have sent dist here. So this is actually a matrix 348 by 348. So we could go directly into a table now by doing as underscore table, and we'll do row names equals a like we had done before. And again, we can do a pivot longer, where we again, I do everything but a our names to will be B. And then values to will be this is unweighted. And so now what we get is a be an unweighted, but something I'm noticing that's a little bit odd is that we have a lot of extra spaces to the right of each of the sample names. I wonder if this is a product of how it was outputted from mother, and that we have a bunch of these spaces. So we'd like to do is strip out those spaces from A and B. So how do we do that? Well, if we want to modify a column, remember that the function we need to use is called mutate. So we can pipe this into mutate. And what we could do would be to say like, let's take a, and we'll set that equal to str replace. And we'll take all the spaces. So we want all of them. So str replace all. So we'll place all cases of a space. And we also need to give it a as the string. So we're going to take a find all the spaces and replace it with nothing. And so now we see that that did get rid of all the spaces, right? So this is one way to do it. And this is kind of the way that's easiest for me to remember, because I use str replace and str replace all the time. But something we might use instead of str replace all would be str trim. And what str trim will do is that that basically does this function, right? So we could do trim on A. And again, what we see is that it removes everything from A. And we could do the same thing with B, right? So let's go ahead and copy this. And instead of a, we'll do B. And so now we see we get rid of all those extra spaces. So I'm going to save this as a comment. And I'd like to show you another way, right? And so again, mutate, I use a lot like this, but there are other mutate functions that perhaps we're not as familiar with. So let's practice those just to get a chance to use them. And perhaps we don't need to write two lines, maybe we can get away with only writing one line. So one mutate that we could do would be mutate at, right? And so we give mutate at the columns that we want to mutate, right? So we could do A and B. And then we're going to say at A and B, what we want to do is run str trim. And again, we see that that worked, right? So that works well. But say you had a whole bunch of columns that you wanted to trim all the white space on, this might be a bit painful. So another approach that we might try would be mutate if. So we could do is is.character. And so what that's going to say is if this column is of type character, then we want to change the column by running the str trim function on that column. Again, we get the same output. So again, these are three different ways to do the same thing of trimming out the extra white space at the end of our sample names here, right? So I'll go ahead and save this as unweighted so that we can compare it to bray and jacquard. And I'm going to copy this down as well, and we'll remove the on so we can have a weighted version as well, right? So we change the file name, change the values to, and I think everything should be good there, right? And so now we have weighted, great. And so now we have bray, jacquard, unweighted and weighted. We want to combine those all together. And we've seen this a lot by now. We can do interjoin. We'll do bray, tbl, jacquard, tbl. And I'm remembering that actually I named these weighted and unweighted without the tbl. Let's go ahead and put those on for now. Yeah. And weighted tbl by a and b. So this will combine the data frames by matching up the sample names that are in the a and b columns, right? And so now we can see we've got ab, bray and jacquard. And we'll repeat this two more steps with what's coming through the pipeline as well as unweighted tbl. And again, we'll do by a and b. And we'll repeat this again to get the weighted, right? And so now we have our samples in a and b and our four different approaches to calculating those beta diversity metrics. So let's go ahead and save this to a variable we'll call combined. So let's make some plots now to compare the four different metrics of looking at the beta diversity, right? And so again, we can take combined. And we can pipe that and we want to get into ggplot, but our matrix is currently square, right? We have, we have, you know, f3d0 compared to f3d1. But we also have f3d1 combined compared to f3d0. So let's go ahead and do a filter for a less than b. So we're only looking at things below the diagonal. And also we're not looking at that self comparison of f3d0, f3d0, right? And so we can then pipe that to ggplot, aes. And so let's put on the x axis for now. Let's go ahead and do bray. And on the y, let's do weighted. So these are the two structure based metrics. And we'll then do a geome point. And let's do geome smooth. And I'm going to go ahead and put in se equals false to get rid of the cloud, if you will, around the line through the fit. And we see a pretty linear fit to the data. We've got our bray-curtis distances on the x axis and the weighted on the y axis. And you can see that as we kind of increase the distance between samples by bray-curtis by that bin based approach, the weighted also go up. What's interesting to me is that at a distance of one by bray-curtis, so where there's no otu shared between the different samples, we still get a weighted unifrag value that's less than one. And again, that's because the weighted approach is a phylogenetic approach that takes into account the relatedness between the different sequences. So the weighted makes things look a little bit more similar than the bray-curtis does, which is interesting, right? So let's go ahead and copy this. And now let's look at our membership based statistics. So we'll do jacquard and we'll do unweighted. And that looks a bit different. And so we mostly see a bit of a, you know, positive trend, right? And so again, as we increase jacquard, we see increased unweighted. Again, like we saw with bray versus weighted, that the jacquard distances, the bin based distance can be larger than the phylogenetic distance. This isn't perfect, right? We see some kind of funkiness going on. Also, there's something weird going on here as well. At a very low jacquard level, we see, you know, a moderate unweighted unifrack value. But again, there is a positive correlation between these two different ecological metrics. Something else we could perhaps think about doing would be to compare jacquard to bray. So again, these are both bin based statistics and jacquard being the membership and bray being the structure. What we see here is a really smooth fit between jacquard and bray. It's not a perfect linear relationship. But at the same time, it's a pretty good consistent trend, which I find to be really interesting, since jacquard has no abundance information in it, that we see such a tight correlation with the bray cardist distances. That's pretty cool. So let's repeat this and then do the same type of thing, but looking at the unifrack. So on the x, I'll put the unweighted and on the y, I'll put weighted. So here we see the scatter plot for comparing the unweighted distances to the weighted distances. It's not as clean of a correlation as we saw with jacquard and bray cardist, but it still looks pretty linear and pretty strong in a positive direction. Okay, so again, this is a visual way of looking at the data. What I'd like to share with you now is how can we get the mathematical correlation using a test called the mantel test. The mantel test is a test that you can use to compare different distance matrices. Here I'm using different beta diversity metrics, but you could also take, say, a beta diversity matrix as well as some type of environmental matrix. Say you had a matrix comparing the similarity of a community chemically, right? You could compare that chemically based matrix to the ecologically based matrix and see if there's any correlation between the two matrices. It's effectively like doing a correlation test, like a Spearman or Pearson correlation, but it does some special things for assessing whether or not the correlation is statistically significant. There's a function called mantel in the vegan package that we'll be using. The input to mantel are two distance matrices. So we need to extract from our combined data frame our distance matrices that we can then feed into the mantel test. So let's go ahead and make those. So we'll start with the braid Curtis. So we'll do combined. And let's do select A and B and Bray. And so again, we get those three columns. And we can then of course do pivot wider names from equals B values from equals Bray. Again, we get the wide data frame that we can do column to row names with a again, that gets us a matrix, but we need a distance matrix. So pipe this to as dot dist. And this then gets us a lower triangular distance matrix. And so I will then call this Bray dist. And we can now repeat this for our four other distance matrices. So we'll do jacquard on that and then we'll extract the jacquard column. And we'll use values from jacquard. All right. And then we'll do it again. But for unweighted dist. And let's copy that name. So I'll have to type it over over again. All right, so then we've got our unweighted. And now we can do our weighted. And we'll do the same thing, except we'll remove that on great. And let me just double check that I ran all these. Great. And so we now have our four distance matrices loaded into R. And we can now do the mantel test. So we can do mantel with let's do Bray dist. And let's do weighted dist. So the mantel test expects the input to be two distance matrices, not two matrices, but two distance matrices. I've been burned by this in the past. And so I don't want you to be burned by that. So if you give it a matrix rather than a distance matrix, it's going to calculate the distances from your matrix. So if you give a distance matrix, it won't do that. All right, so here's the output. What we're interested in is this mantel statistic R. And so we see a value of 0.8734. That is clearly significant. By default, it uses a Pearson statistic correlation. So if you wanted to use the Spearman, as we've seen in other episodes, you could do method equals Spearman. And again, this gives us a fairly similar correlation value. If I want to get out that value, one thing I could do would be to say, let's take the last value, and we could pipe that into glimpse. And this then gives us the structure of the list object being generated by the mantel function. And so the value we want is statistic. So again, I could put on this statistic, and we get back that 0.824458. Very good. So let's go ahead and copy this down a few times. And I want to get the weighted dist and the unweighted dist. So we'll do weighted and unweighted. And then we also want to do bray and jacquard. And then I also want to do jacquard and unweighted. Right. So those are the four main comparisons we want. So let's go ahead and run all these and see what we get for our different correlation statistics. All right, so it ran through those. The mantel function does take a little bit longer to run than like core dot test. And that's because it's doing a permutation based test to calculate the p value for each of these correlation values. I think these correlation values are frankly, so large that it's obvious that they're significantly different from zero. So you could probably, you know, run core dot test on it anyway, and you'd get the same result. But you know, I want to show you how you can use the mantel test. And so here we are. Right. So again, if we compare bray, the bin based method to the weighted unifrack, the phylogenetic approach, we see a correlation of about .82. So that's pretty good, right? Weighted versus unweighted correlation of about .53. If we look at bray versus jacquard, so comparing the membership and structure based approaches using bin based statistics, we get a correlation of basically one, right? That's basically what we would have expected when we saw that, you know, very smooth curve. And then if we compare jacquard to the unweighted, so the two membership based metrics by bin and phylogenetic approaches, we get a correlation of about .41. So that's less than what we saw between weighted and unweighted unifrack. As I pointed out in the last episode, I have yet to have a case where if I compared a bin based approach to a phylogenetic approach, that I got different results. These results here from these mantel tests certainly bear that out. Of course, what I'm looking at here is one dataset from a mouse study. It might be different for your soil study or marine system or human system or whatever, right? But what I would encourage you to do is that if you can, you know, try this and see if it works out. At the same time, if you can try it with the phylogenetic approach, then that means you can build a phylogenetic tree. So you may as well report those phylogenetic results alongside the bin based methods. But if you can't, right, because your dataset is too large to build a tree, then have some confidence that you probably wouldn't have seen an example where the bin based method gave you a different result from the structure based method. Where I might see a difference is perhaps between the membership and the structure based analyses, right? So the mouse samples that I'm working with here are pretty similar, because again, it's a time course of a bunch of mice that we're all living together. They're kind of clonal, right? I could imagine a case where if say you're looking at soils collected all over the United States or all over the world, that structure and membership might not be such a one to one correlation like we saw here. As always, think deeply about what you are measuring and the types of questions you're trying to answer and make sure that your metrics are doing that for you, right? Don't do a whole bunch of data diversity metrics, looking for a significant p value by, you know, looking at Adonis, pick the metric first and then do the test. If you're doing all the distances and then the tests, you're really p hacking. And that's really unfortunately just inappropriate for an approach to studying microbial communities. Well, I hope you found this interesting. Keep practicing with this. As always, please try this with your own data. Let me know if you get a result that's different than what I've got. And we'll see you next time for another episode of Code Club.