 Hey folks, in the last couple months of episodes, we have noticed that if we take the same community and we sample it to different depths, then we get different values for multiple ecological metrics that we're measuring. Well, in today's episode, I want to take a step back and focus in on looking at richness, again the number of different types in a community, and how traditionally ecologists have controlled for uneven sampling effort. As we've talked a lot about that is called rarefaction. And I've kind of talked empirically how we can rarefy our data when talking about beta diversity metrics a few episodes back. But I want to kind of hone in again on richness, and I'm going to do something very scary. I know many of you are biologists because you hate math, but this is all about math. In today's episode, I'm going to show you the math behind rarefaction and how we can use r to calculate the parameters and values that go into estimating, determining the expected number of tax that we would have seen if we had had a smaller number of observations for any of the samples we're looking at. Don't worry, we'll get through this. And also keep in mind that this isn't actually how most of us do rarefaction. Typically we use the empirical approach, but I thought this would be a good opportunity to show you the math behind rarefaction, as well as to show you a variety of the different functions that are available to us within R for dealing with things that have to do with probability, things like factorials, combinations, things like that. All right, so let's go head over to our studio and we'll get going with today's content. We have a starter code of rarefaction.R loading in the tidyverse. I'm not going to use a random number generator today, but I'm going to leave that set seed in there. And then I've got some code here that kind of cleans up a data set that we've been working with. This is a collection of data that came from a study my lab did and published using fecal samples collected for mice. And so we're looking at the different types of bacteria in about a dozen different mice over first 150 or so days of life. I'll go ahead and load this so that we now have shared loaded into our session along with all the goodies from the tidyverse. Great. So now we have shared loaded into our session. And you'll see that we have three columns, the group or the sample, the name of the taxa. Again, I'm working with otus from bacteria. And the value column is the number of times we saw that otu in that sample. I'm going to quickly review with you some of our d plier tools for calculating the number of observations we have for each of our samples. Again, that's the point of refaction is to control for this uneven sampling effort. So if I could take shared, I can pipe that into a group by group by group. And then I will pipe that into summarize. And I will then say n equals some on value. And so now I get my counts. I'm also going to go ahead and add in here something I'll call sobs. And this will be the observed number of otus in each of the samples. And we saw how to do this in the last episode. I even created a function, but we'll do it straight in here. So I'll say some value greater than zero. Again, value greater than zero will convert to a true or false value. True values have a one, false have a zero. So then if you're adding up all those ones, then you're getting the richness, right? And so now we've got our groups, our ends and our sobs. And so we can see, for example, that this f3d0, this was female three day zero, the day she was weaned from her mother. We had 6229 sequences and we had 167 otus that we saw across those individuals. And if you kind of look at the n and the sobs, we showed in the previous episode that as you get a larger n, you get a larger sobs. Again, this is the reason why we need to control for our sampling effort. And if I throw another summarize on here and do say min equals min on n and max equals max on n, we see that we have, what is that, 3000 divided by 1828. It's about a 20 fold, 16 fold variation from the smallest library up to the largest library. And so in this case, I might want to verify everything down to 1828 sequences from each of my samples. So that's the task for today's episode. I'm going to start out with a single sample. So what I'll do is go ahead and grab this shared. And again, going back to our dplyr tools, I'll do a filter with group, and I'll do f3d0. And so now I have all of the samples from female three day zero at 1588 rows. So it's a much simpler data set to work with. So again, we have the value column, which is the number of times we saw each OTU in female three day zero. And we're going to use this frequency information to calculate the rarefied estimated number of taxa that we would see if we had had a smaller number. And so what I want to do is to rarefy to 1828, because again, as we saw up here, that was the size of the smallest library that I had. Again, we could rarefy to any value. But for demonstration practices, I'm going to go to 1828. So here I have the formula for calculating the estimated number of taxa that we would see for a smaller sample size. I'm going to spend a little bit of time walking you through this formula again. So you have a sense of what is going on when we say rarefaction. Well, first, SR is going to be the estimated number of species or taxa that we would have for the reduced or rarefied data set. ST is the total number of observed OTUs that we would have in the observed data set with the full collection of sequences we have. Say we had 6,000, the number of OTUs we'd have versus 6,000 sequences would be SR. And if we wanted to go down to 1828, SR would be the number for 1828. And so then we have N, capital N, which I am going to take to be the total number of sequences. So again, in that silly example, 6,000. Whereas NR is the reduced or rarefied, which would be 1828. And I then is the number of times we see each of the ST different OTUs. So what's going on in this parentheses is not a fraction. You'll notice there isn't a horizontal bar there. And so the way to interpret this would be to say, say I've got 6,000 different sequences. Well, I want to draw 1828 from that 6,000. And so what this parameter calculates is all of the different combinations of to 1828 that I can draw from the 6,000. I'll go through an example with you in R to do this. But that's what we're looking at. So with this fraction of the two combination values is actually looking at is the probability of not seeing an OTU when we sample 1828 sequences out of the bigger pool of 6,000. And so that's the probability of not seeing the OTU. We subtract that from 1 to get the probability of seeing that OTU. So if a sequence is super abundant, then we're going to be really likely to find it. And so this will go away. And so then we'll be left with a 1 value. And so if we have a really rare sequence, then this will approach 1. And so then we'll have 1 minus 1, and that'll be 0. And that won't contribute to the summation. And so if we sum up all these values, and that then outputs the total number of OTUs or taxa that we'd expect to see for the rarefied data set. But again, this term on the right is the probability of not seeing an OTU or taxa in the rarefied data set. We subtract that from 1 to get the probability of actually seeing it. And then we sum up all those probabilities and that then gives us the estimated richness of that community with the rarefied data set. So what is this funny fraction thing without the fraction? Well, again, like I said mentioned, we can think of it in mathematical terms of n choose k is the way this is expressed. So we have n items. We have 6,000 items. And we want to choose 1,828 items out of that. That would be the denominator where we have n choose nr. And so then the mathematical formula for this is n factorial divided by k factorial times n minus k factorial. And so if you forget what the factorial is, that is taking that number. So if we did like 3 factorial, that'd be 3 times 2 times 1, you'd get 6. And so for whatever value of n and k, you can use this calculation to get the total number of combinations of k that you can draw from n. And again, refaction is sampling without replacement. And so this choose is also without replacement. Let's come back to r and see if we can implement some of these ideas. So the first thing I want to show you is the factorial function. So we could do factorial on 3 and get 6, as I mentioned. We could also do 30 and get a very large number. And so these numbers get huge, right? I think if we go up to like 100, we'll get 9 times 10 to the 157 big, big numbers, right? And let's see if we go up to 200, what happens? And so at a value of factorial 200, we get a number that's so large that our computer can't hold it, right? And so this gets to another point of how big of a number can r represent? Well, there's a helpful list variable within r that helps you through this. And so if you do dot machine, so you get back a list containing a variety of values to indicate how r is storing data, right? And so we'll see that the largest double, so floating point number, is 1.797 times 10 to the 308, right? And the smallest positive double would be 2.225 times 10 to the minus 308, right? And so this has a lot of good information in it if you're kind of wondering, you know, what is that upper bound? And so something we might do would be to say something like factorial 1 to 200. And we see that at about 170, we lose the ability to calculate the factorial value. And so again, if we did dot machine, dollar sign, double dot x max, we see that the largest double is about 1.8 times 10 to the 308, which is right above what we'd get for a factorial of 170. Okay, so this will cause problems, because as you know, we will have, say, 6,000 sequences. So how do we calculate the factorial of 6,000? Well, we'll see a solution to that. The other thing that we're thinking about is these combinations. So one way to look at the different combinations would be with the comma n function. And here I can say I want 5, take 3. And this then gives me back a matrix of the 10 different combinations of values between 1 and 5 that have three values, right? So we're taking three numbers out of the vector of numbers from 1 to 5. And we will see the 10 unique ways of looking at that something slightly different would be like expand grid where we might say x equals 1 to 5, y equals 1 to 5, z equals 1 to 5. And this returns 125 rows. But what you'll notice is this is effectively sampling with replacement, right? And so we can find the same number in multiple columns, right? Whereas what we saw with comma n is that we don't have any columns where the same number is repeated across the three different values that we're choosing, right? So comma n is sampling without replacement. So again, if you think about drawing balls or candies out of a bag, you take the candy out, you put it in your mouth, you don't put it back in, right? Whereas expand grid, you don't eat it, you put it back in the bag, okay? And so for refraction, sampling without replacing it, sampling with eating, which is for this case, the best kind of sampling I'd say. So maybe we don't care about the actual combinations, the sets of different values, but we want to know the number as we do for calculating refaction, right? And so there we can use the choose function. So as I showed you with the formula, it's typically expressed n choose k. So if I do 5, 3 for my n and k values, I then get a value of 10. And so again, as we saw from this combination, there were 10 different columns that emerged from running comma n, 5, 3, right? And so that is what we're getting there. Now, if we were to do something like say choose 6000 and I want 1828, this is going to lay an egg, right? This is going to give me an infinite value, which isn't going to work. So how do we do this? Well, thankfully, R has built into it logarithmic versions of these functions. And so what you could do would be to say something like L choose 6000, 1828. And so then this gives me a value. So 3684. And so this is a trick that is commonly used in statistics, math, numerical analysis, where if you have really big numbers, so big that your computer can't hold it, you convert everything to a logarithm. So working in logarithmic space allows us to brush off some of those great math skills from high school, where we learned about the different properties of logarithms, right? So if we have the logarithm of A divided by B, that's the same as the log of A minus the log of B. The nice thing then is that we could, so to kind of demonstrate that, right? We could then say the log of 3 divided by 5 is negative 0.51. And that's the same as log of 3 minus the log of 5, right? So we get the same value. Now, to get out of the logarithmic space, I can use that value as the exponent to the exponential function. So I can say x log 3 minus log 5. And so what do you think the value should be? Well, it's going to be 3 5th or 0.6. Sure enough, we get 0.6. And so if we come back to our formula for refaction, we see that we can take that fraction of choose functions and we can make that the argument to the logarithm and we can make that the argument to the exponential function. Well, that still won't help us because we're still going to look at the log of an overflowed number divided by an overflowed number. But if, again, we use those properties of logarithms and take that logarithmic fraction and say the log of the numerator minus the log of the denominator, then again, we can come back to this L choose function and we can subtract L choose functions from each other to get the fraction in logarithmic space. And then as we saw, we can raise that to the exponent. So let's give that a shot now. So again, I want to go to 1828 and I had my shared and filter on group equals F3d0, right? Very good. So the first thing I want to do is go ahead and remove those zero values. So I'll do another filter and I will say value greater than zero. Okay. And so now I've got 167 rows indicating the 167 taxa here. So I'll call this output example, right? And I'll clean it up by doing a pull on a value. And so now example will be a vector. The pull function is a lot like select. Whereas the select keeps things in a data frame format. So you could do a select on a single column. But that will return a data frame with a single column. Pull removes a single column and puts that into a vector as we see here of those 167 different values. Now let's think about rarefying our data. So if we think about our formula, we're going to in that fraction of the two choose functions, we're going to have n being the total number of sequences minus ni, which will be the counts, these counts that we have here in example. And that's going to be that choose the number that we're rarefying to, right? And so again, what we could do is be choose on n. So that's going to be like sum of example minus the value of example, right? So that's the top, the n choose k. So that's the n part. And then the k part that we're choosing down to would be 1828. We get a whole bunch of infinite values. And that's because we have that numerical overflow, right? And so instead of choose, we're going to use l choose. And so now we get those choose values in logarithmic space. If we wanted the actual values out of logarithmic space, we'd raise these to the x, use these as the argument to the exponential function. But again, we're going to get numerical overflow, right? If I do x on that, again, it's going to turn everything into an infinite value. All right. So that's the numerator. What about the denominator? Well, again, we're going to use l choose. And here we're going to use the n. So that's going to be the sum on example. And what we want to rarefie down to, so 1828. And there we get 3765. And again, that's in logarithmic space. Now, we want to look at the difference of these two values, because if you take the log of a fraction, it's the log of the numerator minus the log of the denominator. So I can go ahead and subtract these from each other. And then I can raise them to the exponential function. So let's look at that value. And so now we get values that, as you can see, that are much more manageable, right? These are relatively small numbers compared to what we saw before when we were running that factorial function, right? And so this now, we can use as the argument to the exponential function. So we can then say xp on that. And so now we get numbers that are pretty well constrained between zero and one. And we then subtract this from one, as we saw from our formula. And so now we get numbers between zero and one again. And we'll see some that are actually one. And so what this is telling us is that these otus have a probability of one of being detected when we rarefied from 6000 down to 1828. So we need to sum all these together to get the estimated richness for that rarefied level. So we'll do sum on that. And so then we get out 126. If we were to do the length of example, we would get 167. So, and again, the sum on example is 6229, right? So if we go from 6229 now down to 1828, we go from an observed richness of 167 down to an estimated richness of 126. Again, this is the numerical basis for calculating the richness or s sub r, as we were calling it before. This is the estimated richness of our rarefied data set down to that smaller value. So of course, we could turn this into a function. So let's do that quickly here. And so what we could do is we'll call this rarefied function. And we'll have x as that frequency count of the different values coming through. And then we'll have a parameter called sample. And we'll throw that into here. But as this will be our special sauce. And I don't need that s sub r. That'll be good. But I do need to change my values. So instead of example, I'll put an x. And instead of 1828, I'll put in sample. And then I will also inside of this remove the zeros from x. So say x is x, where x is greater than zero. All right. And so again, I can come back up to this pipeline. And again, let's just remind ourselves of what we have here. Great, we've got that. And I can then summarize. And I'll say s sub r equals rarefied. And we will do that on the value column. And we'll do it to 1828. And we get 126 point something, right? So good, that worked. So again, instead of filter, we could say group by group. So we get all of our rarefied values. So I'll go ahead and add to this s sub o for the observed. And we'll say that's the sum of value greater than zero. And then we'll also add our n, which will be the sum of value. And so now we get our three columns. And we can see the difference between the observed and the rarefied number of taxa for each of our sequences. So what I'd like to do is perhaps make up a scatter plot, where on the x axis, we have the n. And then our points are colored according to whether or not they were rarefied. And so I'll make those s columns a single column. So I'll do pivot longer. And we'll do calls equals s underscore r and s underscore o. And I'll say values two equals s names to equals type. Okay, that's pivoted longer. And then we can do gg plot a s x equals n y equals s. And then color equals type. And we'll do a geom point. Right. And let's also go ahead and do a geom smooth. So what we can see is that without rarefying, we obviously saw this increasing trend of number of taxa with increasing sampling depth, the teal color, we've removed that. Again, there's a lot of noise in the data in part because these are each point is a different stool sample from a different mouse. In previous episodes, I've been looking at a kind of random sampling from a meta community where I pulled everything together. This is the actual data. Right. And so hopefully you can see that rarefaction does a really good job of controlling for the uneven sampling effort, certainly a lot better than you'd get without any type of correction, say by rarefaction. Cool. So I hope you've enjoyed this deep dive into the math of rarefaction. As I said, this approach tends to be computationally a bit slower than doing it empirically. And so in the next couple of episodes, I'll show you how we can do this empirically. Maybe do a little bit more thinking about what is actually going on in rarefaction. We do this oftentimes with the richness because it's a pretty tractable parameter to be thinking about rarefaction as we have in this episode. Again, I know many people get into biology because they hate math. I'm a biological engineer by training. I love math. I love biology. And I love the ability to combine those together. So that's who I am. But know that if you're using R to make plot to do any types of analysis under the hood, you're doing the math. And I'm a firm believer that you owe it to yourself to at least once go through the math and try to understand what's going on in the formulas. What do the different terms represent and what are we really measuring? Because otherwise you're using the math as a black box and you really don't have an intuition for the results you're getting out. So be not afraid of math. Keep practicing with this and we'll see you next time for another episode of Code Club.