 Hey folks, over the past couple of months, I've been looking at a variety of issues related to the analysis of beta diversity. In today's episode, we're going to start talking about the other concept called alpha diversity. In alpha diversity, we look at samples independently of all of the other samples. We're interested in things like how many different tags are there, what does their distribution or evenness look like? And when you combine richness and evenness, you get diversity. Whereas with beta diversity, you're always comparing one sample to another looking at the types of organisms that are shared between pairs of samples. Now, when we look at various metrics for alpha diversity, we do that without any regard for what type of bacteria or whatever organism is present in the community. What that means is you can take a stool sample and a soil sample, and you could measure their richness, their diversity, their evenness, and find out that they have the same alpha diversity, but they might have completely different types of bacteria in those communities. So again, you can have the same diversity, but very different taxa. In this episode, we will be practicing our function writing skills by writing our own code to calculate these alpha diversity indices. In addition, we will also use vegans built in code for calculating some of these very same indices. And we'll also spend a little bit of time introducing the topic that we've been discussing over the past few episodes, which is how sensitive are these metrics to the level of sampling depth. If you would like to get the code and the data that I am starting with down below in the description, there's a link to a blog post that will get you all the coordinates and everything you need to go to GitHub and get things set up. I've also got a video up here that will show you the step by step process of getting caught up. As you can see, I've got this alpha dot r script that I've created using some of the code that we've been using in recent episodes. I load the tidy verse. I set a random number generator seed. I've been forgetting to do that in all the past episodes and none of you told me, Pat, you need to set your favorite random number generator seed. Anyway, we got it now. The data I'm using comes from a study that we published many years ago. Again, if you want that paper down below in the description, there's a link. It's data that we're using to kind of explore these issues with beta and alpha diversity and rarefaction or not. We read in this shared table. And so as we read it in, the rows are our different samples and the columns are the different taxa. We go ahead and make this tidy so that we have a column for the sample, a column for the taxa and a column indicating the number of times we saw that taxa in that group. We also call those taxa otus. So you might find me flipping back and forth. I then have code to generate a rand data frame. What rand is, is that this was created by taking all of the data from shared and I pool all the counts, but keeping each taxa separate so that we have effectively a meta community that we then sample from to regenerate the distributions for the 225 samples in the study. Effectively, these are 225 random samplings of the same overall community, but where we are retaining the original number of sequences that were in each of the samples. So for now, I'm going to go ahead and read all this in. And again, if I look at shared, I find that is a three column data frame with the group, the taxa name, otu name, and the number of times we saw each individual in that group and in that name. Very good. So what I want to do is for each sample, I want to calculate a variety of alpha diversity metrics. So I would like to calculate the richness, the total number of taxa that we saw in each sample. I want to calculate the Shannon diversity index, the Shannon weaver index, it's sometimes called, I want to calculate the Simpson index, the inverse Simpson index. And of course, then we want to know the total number of individuals in each of our samples. So the idea that I'm working with is I've got shared as this three column data frame, we'll pipe that into group by group. Right. And then we can do a summarize, I'll say sobs, which will be my richness on the data that's in the value column. I'll also do Shannon, which will be Shannon on value. Also do Simpson, which will be the Simpson on value. And then I'll do inverse Simpson, which will be one over Simpson. And then I also want the n, which will be the sum of the value column. Now, if I run this, it complains, because there is no richness function, right? So in a bit, I'll show you how we can get richness out of vegan. But for now, I want us to practice our skills writing our own functions. So we are going to write three functions to calculate the richness, the Shannon diversity index, and the Simpson diversity index. So if you don't remember, that's fine. We start with the name of the function. So I'll say richness. And then the assignment arrow, because we're now going to take the function keyword. And we're going to have one parameter, which is going to be x, which is the counts, right? So we're going to pass in that value column into this richness function. And then the special sauce as I like to call it is what happens in between these curly braces to calculate the richness. And so the richness is going to be determined by taking x and all those values of x greater than zero. And by, if I take things greater than zero, I'm going to get true or false values if x is greater than zero, right? So if I have a vector, say things like zero, zero, one, two, three, zero, and then say, I want those things greater than zero, I'm going to get back a series of falses and truths. Now, if I take the sum function and insert inside of it, this logic statement, I will then get back number three, which tells me the number of times something was true, right? So if I do that here, if I do sum on x greater than zero, this will return the number of values of x that are greater than zero, right? And what we could do is we could say something like r equals sum x greater than zero, return r, right? And so this will work perfectly fine. It's a bit verbose. So I'm going to go ahead and comment this for now so that we don't have the verbose version, but the more simplified version would simply be to say sum x greater than zero. So we have that function loaded now. And I'm going to go ahead and comment out some of these extra lines in here for now, so we can kind of confirm to ourselves that this code works. And so now what I see is I've got the group column and SOBs, the richness. Now what we want to look at is Shannon. So I'm going to go ahead and uncomment that and put the closing parentheses here. So what I'd like to encourage you to do is take this formula that I'm showing here on the screen and see if you can write your own Shannon function that will output the Shannon diversity index. I'll give you a second to do that and then come back and I'll show you how I'll do it. All right, so what would I do? Well, I will take Shannon function x, right? And so now I want to plug in my special sauce. And so what I need to do is I need to get the relabund of each of my values in x. So I'm going to take x divided by the sum of x, right? And so what this will give me is values between zero and one indicating the relative abundance. Now the problem is going to come that when we calculate the Shannon value, which will be negative of the sum of our bond times the log of our bond, that if we have a zero value for that our bond, it's going to cause all sorts of problems. So what I need to do is to limit this to the x values that are greater than zero. And we kind of saw how to do this up here already. But I can use square brace notation to say x, and I want x values greater than zero. And so now I have my Shannon diversity function here. So hopefully you got something close to that. So I'm getting this error, could not find function Shannon. When I get that error message, it typically means one of two things. So either I forgot to load the function or load the package that the function comes from, or I've misspelled the function name. So I'm going to double check by coming back up here and reloading the function, and then running this again. And it says still couldn't find that function. I know it's loaded. So let me take a closer look. And sure enough, I've got three ends in there for Shannon. So I'll go ahead and reload that rerun the code. And now I get my Shannon diversity values in here. And we're in good shape. So the next thing I want to do is go ahead and remove those comments so that we are presenting the Simpson and inverse Simpson diversity indices. And I put that close parentheses in there and get rid of that or comment out that comma. And of course, this doesn't work because we don't have the Simpson function loaded quite yet. So again, I'm going to encourage you to write your own Simpson diversity index, you should know that there are a variety of different versions of the Simpson diversity index out there. We are using a version that I think is called the bias corrected version, where you do the minus one, rather than the straight up squaring of the relative abundance. The values are really similar, probably doesn't really matter. And we'll find later that vegan actually uses a slightly different version still. So I'm going to go ahead and create my function Simpson. And I'll use the function keyword, of course, and I'm going to pass that x as my values from that value function that counts. And in my special sauce, I'm going to start by calculating the total number of individuals in x. And so I'll say n is the sum on x. And then to calculate the value, it's going to be x times x x minus one divided by n times n minus one. And so I need to then sum up all those values together to get the Simpson value. So I'll do some on that. And let's go ahead and load this and then run it through. Everything works good. And we see, yeah, so that we then have one row per group, we have the four different diversity metrics. And that looks good. I now want to bring in a column to indicate the total number of sequences, the number of individuals we sampled from each of our communities. So I'll go ahead and move that parentheses and that pound sign and get that n in there. I think that looks good. And so yeah, now we have the number of individuals in each of our samples. So let's go ahead and plot these data. What I'd like to do is plot the four diversity metrics as a function of the number of individuals. And as we've seen before, I'd like to have four facets for each of the metrics. And so to do this, we can of course do a pivot longer. And the columns that we're going to pivot longer will be SOBS, Shannon, INV, Simpson, and Simpson. That all looks right. And then we will do names to to be metric. And so we've got that pivoted longer quite nicely. And then I will pipe this to ggplot aes. So on the x axis, we want the n on the y, we want the value, we're going to do a geome point. And I'm also going to do a geome smooth. And then we're going to do a facet wrap where we'll facet on the metric. And we'll do n row equals four. And we'll do scales equals free y. And so what we get, of course, is our four metrics plotted with their value on the y axis, and on the x axis, the total number of sequences for each of those samples. And again, each of these points represents a different sample. And so what we see is that for SOBS, I think it's pretty obvious that as you sample more, the total number of things the richness in that community is also going up. One thing to keep in mind is that this is the observed data. There might be some treatment effect in here, whether it's the sex of the mice, the age of the mice, anything that we did experimentally to the mice. And so that we're seeing the SOBS effect here tells me that it's probably a pretty strong effect of sampling effort on the observed richness. And perhaps a lot of the noise that we're seeing, perhaps being caused by those different effects of age or sex or whatever is kind of contributing to the noise we see in those other values. And so something that I would like to do is why don't we replace shared with RAND. So again, RAND is that random sampling of the pooled data. And so again, we would expect that 225 or whatever randomized samples to all be a statistical sampling of the same community. And so those should all give us, more or less, the same value for these different indices. And so if we look at RAND, we again see the same three columns. So I'll just plug in RAND here. And so now we see much more clearly the effect of sampling effort on the observed richness. Simpson seems pretty flat. Maybe it's increasing a little, but I don't think so. Shannon seems to have some effect where kind of in the first 5000 sequences being sampled, there is an increase in Shannon, but that does also seem to kind of plateau. Inverse Simpson seems fairly flat as well as you'd expect if Simpson were flat. One of the reasons that the Simpson and Inverse Simpson might not be so impacted by sampling effort is because you're basically squaring the relative abundance. And so what that does is really emphasize the calculation of the metric to be driven by the most abundant organisms in the community. And so as you sample more, you're really giving more definition to the rare members of the community. Something like Shannon, because it's instead of being squared, so relative abundance squared, it's the relative abundance times the log of the relative abundance, that that gives a more balanced accounting of the different populations in the community. And so you see a bit more of a sampling effect on the calculation of Shannon. Again, this is one data set from mice. If we looked at other data sets from humans or soils or wherever, we might see a larger or smaller effect. So we're going to explore this issue of rarefaction and randomization in future episodes. One thing I want to come back though is to show you how we could do the same type of thing using vegan. And so what I will do is I'm going to go ahead and copy this down. And we will come back up to the top and I'll do a library vegan. So we can then load those helpful functions from the vegan R package. So I'm going to step through each of these functions, and then see how we can change them to use functions from the vegan package. So the first would be richness. And so this would then be spec number. So spec number returns the richness of the community, as I toggle back and forth between the my version and the current version. We see no difference in the richness. Next would be to change out that Shannon value. And so for that, we can use diversity as the function. And we give it the value and then we give it index. And then we can say Shannon. And so now for Shannon as I toggle back and forth, again, I don't see a difference between their implementation and my implementation of the Shannon diversity index. Next we want to look at Simpson. That'll be the same function as we saw with Shannon, except instead of index equals Shannon, we'll use index equals Simpson. So I'll do diversity. And then index equals Simpson. So as I toggle back and forth between the vegan version and my version of the Simpson value, you'll notice the values on the y axis are quite a bit different. One of the things that they do in vegan is that Simpson is one minus the parameter we calculated. And actually, they're not using the bias corrected, they're using the square of the relative abundance. So if we wanted to prove to ourselves that that was what was going on, in fact, I'll go ahead and comment out this my sum line. So I could do one minus the sum of x divided by n and all that squared, reload that function, I could then bring in my version, I could say like my Simpson equals Simpson on this and remove the index stuff. And then I'm going to want my Simpson. So now I see that my version sure enough recapitulates what they are using in vegan. Again, when I look in the literature for how to calculate and measure alpha diversity, I think more often than not, I just see the sum of the square of the relative abundance or perhaps this bias corrected version, I don't usually see the one minus version. I think what's most important is that you know what you're using and what you're looking at, and make that then transparent to your audience. So they know, if they have a higher diversity, you then have a lower Simpson, or what's going on doing one minus Simpson kind of does the same type of thing of one divided by Simpson to get the inverse Simpson does my goal for this episode again, was to help you to learn how to write your own r functions by implementing these three different ecological alpha diversity metrics. Yep, we can do them in vegan and vegan is a great package makes it easy to do that. They've really got some good software engineering under the hood for all sorts of edge cases that we might not account for in our pretty simple implementations of those functions. Again, I just wanted to give you a chance to practice writing your own functions and then seeing that you can implement those in a group by summarized pathway that then of course reads into generating these plots. The second thing I wanted to share with you was that for sure the observed number of OTS or taxa the richness is definitely influenced by sampling effort. The Shannon diversity to a lesser extent but still has some influence of sampling effort on the Shannon diversity metric. And then the Simpson and inverse Simpson have a little bit of a sampling effect but not nearly as great as the Shannon. And that's most likely because of this difference between the relative abundance squared versus the relative abundance time to the log of the relative abundance. Hope that all makes sense. I encourage you to play around with writing your own functions. See where you can, you know, perhaps use these functions or other functions like it that you're writing for your research. It really goes a long ways to helping to make your code dry so that you're not repeating the same chunks of code over and over again. Keep practicing and we'll see you next time for another episode of Code Club.