 Hey folks! If you've been watching recent episodes of Code Club, you know that I've been talking way too much about rarefaction. Rarefaction is a pretty controversial thing these days in the microbial ecology and I suspect the broader ecology literature. One alternative that was proposed by Dr. Amy Willis at the University of Washington in Seattle was instead of rarefying our data, she suggests using the estimated richness because that should be independent of the number of sequences, the number of observations for each of our samples. In her Frontiers in Microbiology commentary piece, she has kind of a toy example where she shows how this can be done and the benefits perhaps of estimating richness relative to rarefying the data. And in an ideal case, yeah, that would be independent of sample size and would truly reflect the actual richness in our community. But how does this actually work with real data? Well, what we're going to do in today's episode is I'm going to generate another null model of my overall communities so that I can sub-sample down to create other communities that have the same number of sequences that were there in the original sample. And so then each sample will basically represent a statistical sampling of a pooled community. So all of those samples should have the same richness independent of their sample size. We saw previously that when we rarefy the data, yes, we get uniform richness across all those samples. But what happens when we do this estimated approach? So I have my estimate richness dot r script open here ready to go. If you want to get your own copy of this along with the data that I'm using, go down below in the description, there's a link to a blog post as well as a video up here that I'll put a link to showing you how you can get up to date. I'll go ahead and load all this code. And as we've seen, we have our rand data frame, which again has the group, the name and the value. Again, there's about 227 different samples represented in here across 44,000 rows. And so for each group or each sample, we have different otus or taxa as well as their observed frequency count. The package that we'll be using to estimate richness is called breakaway. And so I'll do library breakaway. This is a package that was developed by Dr. Willis's lab. I'll go ahead and install that. So it did install breakaway, but it did give me some red text up here. I want to double check. And it's saying warning dependency filoseq not available. So filoseq is a package developed by Paul McMurdy and Susan Holmes for working with microbiome data. I've never really used it before. And so we need to install it because it's a dependency for breakaway. Filoseq is part of bio conductor. And so we saw how to do that a few episodes back. And so let's see if we can remember how to do that. So I can then come into I'll do library bio C manager. And then I'll do bio C manager install. And I will then install filoseq. So we'll run that that might take a few moments to go through. So there is a warning message about the GERT package. I'm going to assume that that doesn't matter. And we'll come back to that if there is a problem down the road. I don't think there will be. So I'll go ahead and reload breakaway as that library. So one of the things about breakaway is that it needs the data in a slightly different input than what we've been working with. So if I take rand and do filter on a group to get F3D0, that's the first group, right, I've got my group, my name, my value. What breakaway wants is a column for the value values, as well as the number of times each of those values occurs, right. And so what I could do would be to do something like count group comma value. Now, of course, I only have one group here. But if we kind of scale it up to all of the other groups samples I have, we will get the count for each of the values within each of the groups. And so now what I see is they have the group column, the value and the number of times. So this is basically saying that there's the 48 otus or taxa that only showed up one time in F3D0. There's 16 that showed up two times, right. So it's a bit of a pivot from what we've been working with. So we could then do is a select to remove the group column. And then we want to feed this into breakaway. So we'll do breakaway. So we get a richness estimate of 317 with a standard error of 68.57 and a confidence interval from 192 to 5555. It's a really wide confidence interval. It also tells us the cutoff that I used was 15. And so what that's saying is that it's focusing on the tax accounts between those otus that showed up one time to those that showed up 15 times to get a better fit of the data. I encourage you to go check out her papers that she has been working on to develop this methodology if you want to learn more. And I encourage you to do that. We'd like to perhaps get some information out of this breakaway object. So I would like to get like the estimate, the range, the model approach, right. And so we can do BA and pipe that to STR. And so then we see we've got estimate, error, estimate, name, and so forth. So if I did like BA dollar sign estimate, I'll get that. If I do BA dollar sign interval, interval, I get the interval, right? And so I can see that I can call on those values to then extract them from the breakaway object, the output of breakaway. All right, so I'm going to come back to this code chunk. And for now, I'm going to remove the assignment operator. And I'm also going to remove the filter. And I'm going to go ahead and remove this select and the group, right? And so again, what we have at this point is the frequency counts for each of the different groups, right? And so what I want to do is I want to make a data frame for each group. To do that, we can do nest. I can say data equals minus group. What this means then is that I have a column called data, which comes from here, that contains all of the data, all of those two columns, except for the group column, right? So the minus means except for. And so now what I've got is a two column table. First column is the group ID. The second column is the data where each row is a separate table indicating the frequency as well as the frequency of those frequencies, right? So cool. Now what we can do is we can feed this into a mutate using map functions. So let's start simple and think about counting the number of sequences, as well as the number of otus in each of our samples. If I was interested in these values for their own sake, this is not the approach I would take. But this is kind of leading us on a journey, so to speak, to think about how we can bring in that data from break away. So we can then do mutate. And I will start with n six. And I will then say that is map DBL. So that's the map function that outputs a vector of double values. And the input to this will be data. And then I will give it a function. And so I want n seeks. And so this is going to be the sum of dot x. And so that is the argument value here for data. And so I'll do dot x dollar sign value dollar sign, sorry, dot x dollar sign n. And so this then outputs the number of sequences. Again, the value column that we saw way back up here, right? This is the frequency, the n is the frequency of the frequencies. So the total number of sequences is the sum of those multiplied by each other. Also, if we'd like to get the observed richness, so I'll say sobs, I'll do map DBL. And we'll do data. And then here we will do tilde again, I'll do some, and then I'll do dot x dollar sign n. Now I have this data frame with my group, the data, the number of sequences and the number of sobs. So hopefully this practice using nest mutate and then map ring some bells, so to speak. And so now we can move on to something a bit more complicated. So what we saw before with BA is that this is a more complicated data frame, right? I can't throw BA directly into here. Actually, you know what, I can. So let's do that. So we'll then do BA equals map. And I'll leave it as map. And I'll do data comma tilde, be break away. And then I'll give that dot x. And so then this outputs the data frame we had before. But it's got that new column BA, which is a list of alpha diversity objects, which is not really helpful to us. When we've done this before with statistical tests, we've been able to wrap it, wrap the function in the tidy function, which then outputs a row from a data frame. So that's what I want to do here. I want to run get break away, and I want get break away to be a function that returns the information that I'm interested in, right? So I'll do get break away as a function. It's going to take in a data frame x, and we will then do BA equals break away, right? And that is going to be break away running on x. We'll then want to output the table. And the first value will be asked, which will be BA dollar sign estimate. We'll then have LCI for the low confidence interval, that will be BA dollar sign interval. And we want the first element in that. And then we'll also do UCI for the upper confidence interval. And so that will be two. And then we also want the model, the method that was used. So I'll then do model equals BA dollar sign model. And that should be good. Let me go ahead and put a line break couple line breaks in here to make it easier for people to see what's going on without scrolling off the side of the window here. So we'll do get break away. And so now we should be able to run this. And so now we see that that BA column no longer is a list of those funky objects, but it is a list of different tables that are one row and four column. So let's go ahead and do a select to remove the data column. And then to unnest the BA column. So we get our good looking data frame and we're good to go. Running break away on 200 and some samples does take a bit of time. And so I'm going to go ahead and save this to a variable that I will call be analysis. Right. So so I can take my be analysis and we can make some plots now. So we'll do ggplot as an x is going to be n seeks and y is going to be asked. And let's go ahead and do color equals model. I'm not sure whether or not it's going to use the same modeling approach for all of the points or not. And then we'll do geom point. And we see that there's a pretty wide range in estimates, right? Some as high as 10,000. So I'm going to zoom in a bit. And I'm going to do cord Cartesian. And I'll do why limb from zero to let's do 2000. So now that we've zoomed in, it does seem, yeah, like, as we increase the number of sequences, the estimate is going up. The negative binomial values seem to be a bit larger than the Kemp based values. So let's see if that trend in the data that we see is real. So I'm going to go ahead and do geom smooth to draw a fitted line through that. For now, also go ahead and remove the model type. So yeah, it does seem that the estimate is going up as the number of sequences go up. Let's see if we can see this a bit more clearly by looking at rarefied data, the observed data, as well as another estimate, which we'll use called Chao one. So I'm going to add a rare column to get rarefied data. And so we'll do map DBL. So we'll give it data, and we'll use the rarefied function. And so the input to rarefied is a bit different than what we've been using for breakaway. With breakaway again, we're using the frequency of frequencies. With rarefied, we want the frequencies. And so we need to kind of unroll those frequencies of frequencies. And to get that, what we can do would be to do rep with dot x dollar sign value comma dot x dollar sign n. So what this is going to do is it's the repeat function. It's going to take each value in the value column and repeat it n times. We then need to give it a sample value, which if I recall is 1828. That's the size of the smallest library in the data set. And so my I see my b analysis has the rarefied values, along with the sobs and the estimated values. So what I'd like to do is come down to my b analysis then, and I'm going to do a select on the group, the n seeks sobs rare and asked. And then I'm going to do a pivot longer, so that I can get each of those methods in a single column, as well as the estimate for each of those methods in a single column. Because what I'd like to do is make a scatter plot of the three different methods where we color by each of the three methods. So pivot longer, everything, but the group and the n seeks. And so now we see, yeah, we've got the group, the n seeks the name of the method and the value. We can pipe this in a gg plot. And so on the x, we have n seeks. The y, we have the value. And then the color is going to be the name. We'll do geom point, we'll leave on that geom smooth. So now we see that the rarefied value is dead flat, like we've seen in previous episodes. The blue line is the observed richness, which we see, of course, increasing with sampling effort. And then we also see the estimated value also going up with sampling effort. So let's add another estimator. And that is going to be the chow one. So I'll do get chow and function on x. The formula for chow one is pretty simple to remember. It is the number of observed o to use plus the number that show up once squared divided by two times the number that show up twice, right? And so if we want the sobs, that is going to be again, we've got this frequency of frequencies data frame, and that is going to be the sum of x dollar sign n. So I'm going to make a practice data set x. So I can test what's inside this function. So I will do x as I'll do Rand, pipe that to count on group and value. And again, that's going to give me x with all that. And to kind of make it simpler, I will filter to get group of F3D0 and I'll do select to remove the group column. So now I have x as this frequency of frequencies table for that. And so now if I look at sobs, I get 189. I now want the single tens, right? So that would be a value of 48. And so to get that I can do x. I could do x row one column n to get 48. Alternatively, what might be a little bit safer instead of one because say that column was a zero, then that wouldn't be the first row, right? So let's go ahead and do x dollar sign value equals one. And that gets us back 48. And then we can then do pull n to get the actual vector value. This is a little bit of a mix up of tidy verse and base R. So I'll just say that's sing. And so then let's take that and let's do dub for the double tens, right? And let's get the single tens loaded. And I'm not sure if I ever loaded the sobs. And so then we can do sobs plus sing squared divided by in parentheses two times the dub. And that gives us 261 as the estimate. Great. So now we've got get chow loaded. And we can add that into here, where we can then do chow, map, DBL, data comma, tilde, get chow. And we can give that dot x. And then we'll put in that comma. And then down in this second pipe when we're making the plot, we're going to go ahead and select to get the chow along with all those others. And so now we see that red is the chow green is the estimate. And so yeah, regardless of the estimator in these case of using breakaway or chow one, we still see that increase in richness with increase in sampling effort, regardless of whether using refraction in your own work. I hope you found today's episode interesting. Again, thinking through a problem, trying to create conditions where we can test that hypothesis that estimating richness will get us around the sensitivity to sampling effort. Certainly in our hands with these data, we see that there definitely is an influence of sampling effort on the richness estimate. The other thing that's just a little bit strange and worrisome to me is that the estimates are very sensitive and kind of really wide range. I didn't show you the confidence intervals here, but they're gigantic. So wide that I almost feel like they're not really informative to our purposes of trying to do microbial ecology. Anyway, again, I hope you enjoyed the review of using nest, mutate with map, and then unnest to work through how we can bring more complicated data sets into our existing data frame. Encourage you to play with this using your own data as always. If you've got any comments or feedback, by all means, leave a note down below in the comments and I'd really appreciate that interaction. We'll keep practicing with this and we'll see you next time for another episode of Code Club.