 Hey folks! Aside from the controversy of whether or not we should be rarefying our data, in the recent episodes I have been talking about how rarefaction works, what is the math behind it, conceptually what's going on. Two episodes ago I showed you how we can use the choose function and factorials to get the pure math, the exact computation of the expected number of taxa we would see for a smaller sampling depth than what you may have gotten experimentally. That way then we can compare samples, richnesses, on an equal footing, an equal level of sampling effort. In the last episode then I showed you how the conceptual idea of a collector's curve, where you might go into a bag of candies or go out into a field and sample individuals, and as you're sampling individuals you keep track of how many new things do you see, and you generate a collector's curve, right? On the x-axis you can think of number of individual sampled, the y-axis the number of new types sampled, right? And so we talked about how that profile of sampling new taxa versus sampling effort is a random process, and so if you repeated it again you'd get a different shaped curve. And so what rarefaction curves really are is repeating that collector's curve process say a hundred or a thousand times and generating the average curve through all those collector's curves. In general I'm not super interested in a refraction curve. I don't need to know the expected number of taxa I would have seen for every possible number of individuals that I could have sampled. It's just not that relevant to me. Typically what I would do is pick a threshold say 2,000 individuals or 10,000 individuals and then I would want to know how many taxa would I see at that sampling depth across all of my samples. And so that's what I'm going to talk about today, is empirically how can we pick a threshold sampling depth and then get the expected number of taxa for that sampling depth. We'll do this empirically and we'll compare it back to the exact solution using that rarefaction function we created a few episodes back. Here I am in RStudio. I have my rarefaction.r script open and ready to go. You can get your own copy so you can follow along. It's really the best way to learn this stuff. You can get the code, the data, add a link down below in the description. Also I'll leave some instructions here so you can get caught up by using all that information in the blog post. So as we've seen I have loaded the library, the tidyverse, I'm setting a random number generator seed, and then I've got some code here to read in data from a mouse study that my lab published a number of years ago. Again if you want more information about that study down below in the description you'll find some information. Anyway this generates a table called shared that has a column for the sample ID or what I call group. The name of the taxa in my case it's going to be otus and then a column for the values which is the number of times each otu is seen in each sample. Then I've got my rarefaction function here which does the exact computation of the expected number of taxa I would see for a given sampling depth and for a given frequency distribution. I'm going to load all this. There's other stuff in this script that went into generating those rarefaction curves and collector's curves that I'm not going to worry about too much today. And so again if we look at shared we see this three column data frame. So I'm taking on a problem like this. I try to kind of make things as simple as possible before I expand them out. So in this case what I'll do is filter my data to only look at the group samples that were from f3d0 and so now I have the 1588 rows that correspond to that sample. I think I'll also go ahead and limit to the values greater than zero. There are some zero values in here and so we see there's 167 different taxa that have observed counts. What I'd like to do is to deconvolute this data from. Now we saw this in the previous episode that I can take that value column so like that 115 for o2u1 and I can uncount the data so that it then replicate o2u1 115 times o2u2 26 times. So again we can do uncount on a value and so again I see that I've got o2u1 a whole bunch of times right and so now what I want to do is I want to sub sample out this data frame to get out 1800 and 1828 rows randomly selected from my data frame and so I can do this with sample n, sample n function and I can say size equals 1828 and now I can count the number of distinct o2u names. I talked about this in the last episode where I needed like a cumulative and distinct function but I didn't have one so what instead I'll do is a summarize and I will say s for number of species or number of taxa equals n distinct and if this is the name column and so I get back a value for s of 130. So now I'm going to try to scale out to all the samples. I'm going to go ahead and remove this line 73. I also don't need that value greater than zero because when I do the uncount if an o2u has a count of zero it's not going to be deconvoluted zero times or it's going to be deconvoluted zero times so it doesn't show up right so I'll go ahead and remove that line so in here I'll add a group by group right and I'll pipe that into my uncount and let's give that a run and see what we get well sure enough we get 227 rows one for each of our samples with the number of observed taxa now you'll notice that this number is fluctuating that I had for f3d0 above I think it was 126 and now it's 119 again that's because the sample n function is using a random number generator so if I run this multiple times I'm going to get different values every time I run it and so that's why I need to repeat this say a hundred or a thousand times so that I can get a sense of what is the central tendency what is the average number of taxa I would see if I got 1828 sequences from all of my samples so what I'm going to do is I'm going to turn this into a function that I can then replicate any number of times so I will call this subsample it'll be a function so I'll give this data which will be the data frame that I'm bringing into like I have shared here and then I'll have sample size for the number of sequences that I want to subsample down to and I'll create my curly braces to define the body of that function and I'll bump that over and instead of shared again I want data and instead of 1828 I want sample size right and so now if I do load that so I should be able to take subsample on shared and 1828 and I then get output much like what I had before so again now we need to replicate this a bunch of times and what we can do is we can replicate this using the map dfr function so I could do map dfr and the values that I want to iterate over let's do one to ten again a nice small data set so we're not waiting a long time only to find error messages and we'll put a tilde in front of that subsample and then I want an id column that I'll call itters and then I'll close out with the parentheses running that we now get 2270 rows so again there are 227 samples 10 times replicates and so I will then call this subsamples or maybe subsamplings right and then I can take subsamplings and I can then feed that into a group by um group and then summarize on um I'll say s being the mean of s I believe this column was s right yep it is so we'll run that and so then we get our rarefied values for each of the different groups I'll call this empirical and then I also want an exact and so I'll take shared and then feed that into a group by group and then summarize um s equals rarefied and then my rarefied is the data coming through and uh the samples of the 1828 um actually that x is going to be value so I want the value column the the number of counts for each oq in each sample and so now I get my yeah I've get my exact so I'll call this exact we see the values here look fairly similar to what we got for the empirical before and what I can do is I can go ahead now and bring those together and so previously I might do this but the join let me show you a different way and that's with bind rows so I can bind two data frames together uh row y so one will be at the top rows one will be the bottom rows if they have the same column names and so exact um has group and s and empirical also has group and s so I can then bind rows with empirical it'll say empirical equals empirical and then exact equals exact and the reason I'm naming the data frames this way is because I can then have an id column that will then be approach and so as you'll see we now have a column for the approach where it's empirical or further down exact and what we could then do is we could then say plot the points with like a slope plot right my favorite plot we've seen a lot of me making slope plots in the past so we'll do gg plot aes x equals and we'll do approach so we'll have exact and empirical and then y will have s and we'll have group equals group so we'll connect the points by their group and then we'll do geome line and so as we can see the lines look pretty flat um we are only using 10 iterations so there's a bit of noise I would say in the data but I'd like to go ahead and build out these plots kind of build out our analysis before we scale up to a larger number of um refaction steps so that's one approach we could look at I don't know that's super useful another approach would be to put the exact values on the x axis and the empirical on the y and represent each group with a different point and we'd expect them to fall on a straight line so let's go ahead and um take this line again and this might be a case where it would be useful to use inner join instead because I'm going to want exact and empirical in separate columns but we'll go ahead and use a pivot wider instead so we'll do pivot wider I'd encourage you to experiment using the inner join to see if you can do that instead of bind rows and pivot wider again we'll get to the same place here um and so what I will do is names from and it'll be approach and then values from equals s so sure enough we now have an empirical and exact column I can then do gg plot and then aes x equals exact y equals empirical and we'll do geome point and yep they fall on a pretty good line let's go ahead and add in here a geome ab line and so geome ab line allows you to indicate the slope and intercept so we'll say intercept equals zero slope equals one and I'm going to make the color gray and I will add that and I'm going to go ahead and add um theme light I'm going to spell light right and let's go ahead and add uh theme light to this previous plot as well for a slope plot and this is again what our plot looks like with the um the ideal case where the values are identical uh going in behind there so a final way I'd like to look at the data is by looking at the percent error of the empirical relative to the exact so again I'm going to take these two lines 97 or 98 um and we will then go ahead and do a mutate on to create a column called error and I'll say 100 times and we'll do um exact minus empirical divided by exact yep and so then that gets us our percent error and then I'm going to feed this into making a geome density plot and so the gg plot aes x equals error and then we'll do plus geome density and I'll go ahead and do theme light and there we go we see that we have central tendency right at zero maybe a little bit to the right of zero uh we could get that exact value by again borrowing uh these first three lines of this latest pipeline um and we could then do a summarize with uh mean uh of mean of error and we could do sd on the sd of error and we see that our average error rate is right at zero right that's point zero zero five seven seven percent so it's pretty small but we do see kind of a wide spread in the data we have a standard deviation of just over one now let's repeat this going back up and instead of using 10 let's use 100 and see what happens to the shape of these different plots as well as the mean and standard deviation we can see that our mean is still pretty close to zero our standard deviation has gone from just over one to under point four right so before we had point zero zero five seven seven one point one nine uh and now we're a smaller uh variation from the mean and our state standard deviation is quite less if we go over and look at our figures before this geome density plot went out to plus or minus two and now we're kind of at plus or minus one uh similarly if we scale back here looking at putting exact on the x and empirical on the y those points are a lot tighter um and here again was our slope plot and we can see that those slopes uh just kind of visually look a lot more horizontal although there certainly is still some kind of vertical variation if you will let's go now and do a thousand iterations so repeating that code with a thousand iterations we see that we still have a very small mean right at zero and our standard deviation went from like point three three down to point one two one and so as we increase the number of iterations again the standard deviation is going to keep dropping looking at our density plot um we again see a bell shaped curve kind of center just to the left of zero as we would have expected from that summary output um and we noticed that our curve uh started at plus or minus two when we had 10 iterations down to one and now down to about plus or minus point four or so and if we look at this um plot with the exact value on the x and the empirical on the y those values look pretty dead on um as does this slope plot looking pretty flat for all those different values so usually people are pretty content with a thousand or a hundred iterations as we can see there's not a huge difference in the values that you will get for the same data again i hope you found this progression of episodes useful two episodes ago we talked about the exact solution for calculating the estimated number of taxa we would see for any sampling depth for a given uh frequency distribution in the last episode we saw how we could make collectors curves and then convert those into refaction curves again the refaction curve is the average of a whole bunch of collectors curves and then in this episode we hone down into a single um sampling depth that we want to then look at across all of our different samples of course there are many different ways to do the same thing that i've done here what i would encourage you to do is of course as always do this with your own data but i would encourage you after watching this and kind of running through the code with me to start with a blank script and to repeat the analysis that i did see if you can write your own code to verify the data and then compare it back to the exact solution and see if you get the right value right see if you get a value that's really close to the exact value that would have been predicted from the mathematical theory if you can do that then you've definitely mastered a whole bunch of concepts beyond what is refaction and that's what's really important here again the ability to think through a problem and pull together the different tools that we have at our disposal here in r so practice with that exercise and we'll see you next time for another episode of code club