 Hey folks, if you spend any amount of time in the microbial ecology literature, you will notice that there are two approaches to looking at alpha diversity and beta diversity. In the past, as several episodes, we've been looking at one approach which involves taking our sequence data and putting them into bins. I've been using operational taxonomic units to define a bin, but we could also classify things to a genus or family or phylum and use those as bins and then use that data, the relative abundance of those different bins by each sample to look at things like richness, diversity, beta diversity, and so forth. Well, there's another approach, and that is what's called a phylogenetic approach. In a phylogenetic approach, you take all of your sequence data and you make a phylogenetic tree out of those sequences, and then you look at the clustering within that tree. You look at the branch length of the tree for each sample and again comparing different samples. So what I want to do in this episode and the next episode is look at these phylogenetic approaches for describing alpha and beta diversity and see how they compare back to our bin-based statistics. In this episode, we're going to look at alpha diversity and the metric that is used for a phylogenetic approach is what's called phylogenetic diversity. So what you do is you build a tree for each sample and then with that sample's tree, you then calculate the total branch length across the tree and that's your phylogenetic diversity. Proponents of phylogenetic diversity approaches prefer it because it looks at the relationship of the sequences whereas if you put things into a bin, you lose all information, all context, about how those different bins relate to each other. The challenge with phylogenetic approaches, however, is that with modern data sets, we have large data sets, right? And so it becomes increasingly just impossible really to generate a meaningful phylogenetic tree that you can use as input to these different metrics. The data set we're using here from my mouse study is small enough that I was able to build a neighbor joining tree and run it through these different phylogenetic approaches using the mother software package. So I'm using this data set as an example but know that if you try this with your own data, it might not be tractable because your data set might just be too big and it might just suck up all the RAM on your computer. This episode, we're going to look at phylogenetic diversity at the alpha diversity scale. We're going to use output from mother and compare it to what we're getting out of vegan when we rarify our data. Over here in our studio, I now have a blank phylogeneticdiversity.rscript. As always, if you want to get the code and the data, down below there's a link in the description to help you get going. I've also got a video that'll give you instructions for how you can use that information in the blog post to get caught up. As always, we'll also do library tidyverse to load all the goodness from the tidyverse package and I'm going to start by reading in the phylogenetic diversity data so we can see what that looks like and so what I will do is read TSV because it's a TSV file and it's my data directory. Let me go ahead and open that data directory over here in our studio for you to see and what we want is this micephylogive.rarefaction file. Yes, phylogeneticdiversity is also sensitive to sampling effort so it needs to be rarefied. We're not going to go back into that but just trust me on this. So we'll then do mice.phylogive.rarefaction. Reading that in, we then get a 659 row by 361 columns. The columns are our different samples. The rows are the number of sequences that were sampled. You can see that it basically outputs every 100 sequences. There are some values in between those and that's for samples that perhaps didn't have more than 15 sequences in them so we clearly want to get rid of those. So what we need to do is figure out what sequencing depth we want to use for our analysis. So I'll go ahead and do a poll on numSample so we can see the full range of values there and so I'm going to look back through this vector of sequencing depths and we're looking for the oddball values and so what I want actually is this 1804 because the next smallest dataset is actually this 1414. So the 1804 is what I want. You'll notice that this number is a little bit different than what we've been using in previous episodes. That's okay, there's just slight differences. I had to rerun mother to generate this file. I've also put an updated version of the shared file in this data directory. So we're going to use 1804 for this analysis and so I will then do filter numSampled equals equals 1804. This then gives me that row so I'll go ahead and do pivot longer on everything but the numSampled column and we'll do names to the group column and then the values to I will do filo div and so now we see we've got numSampled, our group and our filo div. I don't want that numSampled column so I'll go ahead and do a select minus numSampled and there we go. We have our table with our group as well as our filo genetic diversity that's been rarefied to 1804 sequences per sample. I will go ahead and call this filo div as my data frame. Now what I need is some comparison that I can use a bin-based metric to compare back to this filo genetic diversity. So we've seen things like this before and so we'll kind of quickly go through that but we'll do read tsv data forward slash mice dot shared and I'll go ahead and do a select on group and anything that starts with otu and so again if you look at my column names here that'll be the group as well as all those otu columns and I will then do a pivot longer on everything but group. All right so I've got that table now I want to remove any sample that has fewer than 1804 sequences so I'll go ahead and do a group by group and then I'll do a mutate n equals sum on value and so then this gives me that column with my n and I can of course do an ungroup to remove that group grouping and then I can do a filter and I can then do a filter for n greater than or equal to 1804 and so now I've removed those smaller data sets and now what I want to do is expand this back out wide to make a data frame and so I'll do a select minus n and I will then do pivot wider with the names from it's going to be the name column and values from being the value column and just to double check yeah we have name and value there so that'll be good and so now we're back to our wide data frame that only has those samples that we are interested in carrying forward column to row names with the group column so we now have a data frame that we can use as input to the vegan package functions and so I'm going to come to the top of this pipeline and call it otu data and I will also come back up to the top of my script and I will add a library vegan I'll now verify that otu data frame by taking otu data and piping it into rarefie and I want my sample value to be 1804 so I want to get 1804 sequences from each of my samples I now get out a vector with the expected number of otus that I would see sampling 1804 sequences from each of my samples I do get a decimal number because again this you can think of this as like an average of say a thousand re-samplings of the community under the hood vegan isn't doing re-samplings it has like an empirical formula that it's using that we talked about in a previous episode I can then turn this into a tibble by doing as tibble and I can then say row names equals group so this then gives me a data frame with a column for my group and the value I'd really rather that value be richness so let's do select on group and richness equals value and so again now we have richness as our column and I'm going to go ahead and call this richness I would like to get a third metric of alpha diversity so we have the phylogenetic diversity from the phylogenetic approach we have richness which again is the number of different taxa that we see I would also like to get the rarefied Shannon diversity I know we've talked about this in previous episodes but I think it'll be a good comparison and of course it'll be a great review for thinking about how we would rarefy diversity estimate like Shannon using our otu data so to get rarefied Shannon diversity data I'm going to start with my otu data data frame and I'm going to pipe that to our rarephi and I'll do sample equals 1804 so our rarephi will do one sub-sampling of the community again it outputs a data frame I can then pipe this into the diversity function which the default calculator for diversity is the Shannon diversity estimate and so now we get Shannon diversity values for each of our different samples I now want to repeat this say a thousand times so I can get the average of these sub-sampled Shannon diversity values to do that I'm going to turn this three line block into a function and then I'm going to repeat that a thousand times and then output the average of all that so I will call this Shannon iteration and that will be a function and it's not going to take any arguments and I'm going to wrap this body with my curly braces and I'll just go ahead and indent this so it looks pretty and I will load Shannon iteration so again if I run Shannon iteration I will get one iteration and if I keep running this I'll get different Shannon diversity values every time I run it so instead of rerunning that function thousand or a hundred times I can use a function called replicate so I'll do replicate and then the n is the number of replicates I want to do I'm going to do a hundred to keep it relatively simple and then I'll do Shannon iteration oops so what I get is a list a hundred units long that is repeating the body of my function each time right so what I actually want to give this is the expression and so that's going to be Shannon iteration with its own parentheses so that it's running Shannon iteration each of those hundred times so this outputs a matrix and you can see the row names are my different samples and I then get a hundred columns for each of the hundred iterations so what I now want to do is I want to take this array and I want to turn it into a tibble and so I will then do as a tibble and I'll do row names equals group so I get a warning message after adding that as tibble statement to my pipeline and that is because my columns in the output of the replicate don't have names they're numbers right and so what I could add here is an argument that would be period name repair equals unique in quotes and that will make sure that all of my columns have unique names so I will run that and so we'll see what that does but for now I'm also going to add a pivot longer on everything but the group column and let's go ahead and see what that looks like because ultimately what we're going to want to do next is take each group let's go ahead and do it now and we'll do a group by group and then we'll do a summarize we'll do Shannon equals the mean of the value column so that outputted some information about what it did with those column names that I don't really care about what I care about is Shannon and so now I get that tibble that has the group name and the Shannon now I've got three data frames I got one for phylogenetic diversity one for the richness and one for Shannon and I want to bring those all together so we'll do interjoin on phylo div and richness and we'll do it by the group column and then we'll do another interjoin with all that stuff plus the Shannon data frame with the by equaling group and so sure enough now what we get is a data frame that's got the group column phylogenetic diversity the richness and the Shannon and I want to know how do the bin based metrics compare back to phylogenetic diversity so I will go ahead and call this pipeline combined and what we can begin to do is to think about making plots where we can plot say the bin based metric on the x axis and say phylogenetic diversity on the y and so I'll do combined pipe to ggplot as and so let's start with x being richness y being phylo div and then let's do geom point and I'm going to go ahead and add in the geom smooth so we'll get a line through the data to see what it looks like and so again we see richness across the x axis phylogenetic diversity on the y axis and it appears like a fairly strong positive correlation between richness and phylogenetic diversity great so let's try this also with Shannon on the x axis so this appears to have some positive trajectory to it there are these smaller Shannon values that are kind of throwing off the appearance just a bit maybe what we could do instead would be to filter those out so we could we could do this a couple ways I'm going to go ahead and use the filter function I'll do filter Shannon greater than two add that to my pipeline and so you know it it it's not a very strong positive correlation like we saw before with the richness like this one here but there is still a bit of a positive trajectory and so it doesn't appear that Shannon is super strongly correlated with phylogenetic diversity we'll come back and quantify that here in a moment but what we'd maybe also like to do is let's go ahead and copy that I'm going to remove that filter line for now and on the x I'm going to put Shannon and y I'm going to put richness and so what we see here is a pretty strong trajectory actually between Shannon and richness so let me go ahead and put that filter back in so that we're not kind of getting things so skewed by those low diversity samples and so this shows a very strong I think positive correlation between richness and Shannon and ultimately that's because when we verify our data the richness that we're measuring is actually more of a metric of diversity rather than just straight up pure richness so anyway it's interesting to see that positive correlation there so what I'd like to do now is take combined this data frame and I want to look at the correlation between the different columns and to look at a correlation what we can do is the core test function and so we can give it two variables so we'll give it combined dollar sign phylo div and combined dollar sign richness this then very quickly outputs the results of running a correlation test the default is the Pearson's correlation I think I'd rather use the Spearman and so what I'll do is method equals quote Spearman Spearman is a non-parametric test this then gives us a row of about 0.465 I think before with Pearson we're getting 0.535 I feel like the Spearman is a little bit more honest we get this warning message that the p-value we can't get an exact p-value when you have ties and values so the p-value we get is minuscule that's clearly different from zero that's what it's testing is this row value significantly different from zero yes and if you don't want that warning message you could always go ahead and do exact equals false that then gets rid of that warning message so let's go ahead and repeat this for the different alpha diversity matrix so I'll do FiloDiv and Shannon and I'll also do richness and Shannon so if we run these three tests what we'll find is that we have a very strong correlation as we expect from the plot between Shannon and richness of row of 0.88 for Filogenetic Diversity and Shannon the row was about 0.31 it's still positive but not nearly as strong as what we saw before what we saw here between Filogenetic Diversity and richness and so perhaps the relationship between the Filogenetic Approach and these bin-based metrics of alpha diversity isn't as strong as you might hope for to say well I'm not going to worry about Filogenetic Diversity I'm only going to worry about the bin-based methods still like I said I have yet to see a case with my data and really looking at data from the literature where I see a really compelling difference between results using Filogenetic Approaches and bin-based approaches for alpha diversity I hope that you get something out of this even if you don't care about bin-based methods or Filogenetic Methods this again I think is a really good review of a lot of the things we've been doing with vegan in recent weeks combining things from base R combining things with D-Plyer and GG Plot and the overall tidyverse you can again hopefully see how I approach a problem again this question of how do Filogenetic and bin-based methods relate to each other in the next episode I'm going to use the Unifrack Metrics the Unweighted and the Weighted Metrics and compare them to Bin-Based Metrics like Bray Curtis and Jacquard to see how well they correlate with each other so that you don't miss that episode or any of the fun stuff I have on tap please be sure that you subscribe to this channel you click the bell icon so you get a notification and more than anything please give this video a thumbs up and tell your friends about what you're doing here keep practicing and we'll see you next time