 Hey folks, in recent episodes of Code Club, we have been using tools from the tidyverse to do our own rarefaction analysis of the number of different taxa in a variety of microbial communities. We've been doing a lot of the heavy lifting ourselves, and I think that's been useful to help us learn more about rarefaction, working with random processes, and then also just getting more practice, more reps, using tools from dplyr and other functions and packages from the tidyverse. Well, in today's episode, I am going to show you how you can streamline your analysis by using tools built into a great package for ecological analysis called vegan. So today, we will use vegan to look at rarefying our data to a specific number of sequences. We'll also look at the probability of seeing any of our different taxa for that level of sampling effort, and then we'll also generate refraction curves for our 227 different samples. And we'll see how we can get the data out of a vegan object and into a object that we can use to make a nice plot using ggplot2. Here's the r script that I'll be working with today vegan underscore rarefi.r. If you want to get a copy of this script as well as all the data so that you can follow along, and that's something I really encourage people to do it because you'll learn it so much better if you're following along and you're playing with the data and your code. Anyway, if you want to get this down below in the description, there's a link to a blog post for today's episode. You can use that information along with this video that I have up here in the corner to figure out how to get caught up with where we are today. So again, you can follow along, and that is the best way to learn this material. So again, we read in the tidyverse tools, and then there is this big chunk of code for reading in data from a file that's in data called mice.shared. Mice.shared was generated using a tool called mother. That's pretty specific to analyzing 16S ribosomal RNA gene sequences. Mother also will do rarefaction for you. So if you're getting data out of mother, you know, you may as well just stay in there and do the rarefaction there. I realize that not everyone is working with microbial ecology data or even using mother if you are using microbial ecology data. So that's why I want to spend a little bit of time showing you how we can go through a object, a data frame, a tibble like shared, which has a column for the sample ID, a column for the different taxa and then a column for the number of times we saw each taxa in each of our samples. I will go ahead and load everything here. I also have a function that is in this script for calculating the number of otus or taxa that you would expect to see for a given level of sampling effort. Looking at shared, we again see those three columns that I mentioned. As I mentioned, we're going to do a lot with vegan today, so I need to go ahead and load the vegan package. And if you don't have it installed, be sure that you go over to the packages tab in the lower right window of our studio and install vegan. And then we have access to all the great vegan functions by using library vegan. Now, one of the things that we see when we do this is that the following object is masked by .globalenv because vegan has a function called rarefi and I have a function called rarefi. So what I'm going to do is I'm going to rename rarefi to be my rarefi. So we'll do that. Again, we have this as a tibble where we have three columns and to have data that's properly formatted to go into vegan, we actually need this to be a data frame where taxa are across the columns, our samples are in the rows, and that our sample names are the row names. So we need to do a little bit of work to get this tidy data frame into a good working shape for going into vegan. So I'll take my shared data and pipe that into pivot wider, and we will then do names from name and our values from value. I don't think we actually need those two columns, names from and values from when the names are name and value. I think those are the default names that it automatically looks for. I like to leave it in there again to be explicit and to show you that if you have different column names, you can still do this pivot wider. Something else I want to add in here is to say values fill and I'll say equals zero. So what's happening again, if we look back up at our shared file, that the samples, this first column is the group, those will be in the rows are otus or taxa, those will be in the columns. And so you can imagine that as you pivot wider, there might be row column combinations, sample otu or taxa combinations that aren't in the tidy version. And so what what pivot wider does is it puts in an NA value by default. So what I'm saying is instead of putting in an NA value, please put a zero in. And so again, what we see is that we then get this wide data frame. So I'm going to convert this into a data frame. So as that data frame. And the reason I need to convert it to a data frame is because I need to put my samples, my groups as the row names, tables do not allow row names. And so what I can then call this will be shared D F. And then I can say row names, shared D F, and that's going to be shared D F dollar sign group. And then I can take shared D F and say that that is shared D F. And I'm going to use the square brace notation, and I'm going to get to the column, I'm going to say minus one to remove the first column from shared D F. And so now if I look at str on shared D F, I find that I've got all my different columns for my different taxa, the different otus. And I've got 227 rows and 1588 columns. So we're in good shape. That is now stored as a data frame, we're ready to take that now into vegan to do things like rare faction. And so what I will do is to do rare fi, and I will take shared D F. And I'm going to verify it to the smallest size sample. Well, how do we get that? Well, let's step back here and let's go back up to shared. And we'll do a group by group, right? And then we'll do a summarize. And we'll say n seeks equals the sum of value. And so now what we get is a column for our group and n seeks. I'm going to do another summarize on this. And we'll do min equals min on n seeks. And so that gives us 1828. And I'll go ahead and do a poll on min. And we will then assign the output to min and seeks. And so now if you look at min and seeks, we get 1828, that's great. And then I can plug that in here, min and seeks. So I get back 1403.62, which doesn't seem right for a number of reasons. So first of all, I'm only getting one value, I should be getting 227 values, one for each of the samples. And I think that's actually the value across all of my samples. So I think what's happening is that I'm using my version of rarefi, which I had changed to be my rarefi, as you recall. But I think it's still using the other version of rarefi my version of rarefi instead of vegans, right? So one way that we could check that would be to say vegan, colon, colon rarefi. And what the vegan colon colon does is tells are, hey, I know there's multiple versions of rarefi out there, use the version from vegan, please. That gives us a quite a bit different output, right? Now we get different richness values for each of our samples. So I could keep doing vegan colon colon rarefi. But if I want to keep things straight without having to worry about that vegan colon colon approach, what I could do is restart R. And then because I have my rarefi or rarefi renamed my rarefi, then when it reruns everything, it should have things straight. Alternatively, I'm going to go ahead and clear out my environment so that my environment is now empty. And I can then go ahead and rerun everything should work. Again, I have the the vegan rarefi here. But if I remove that vegan colon colon, again, I now have the correct version of rarefi. And if I look back in my environment tab, I see the functions I have my rarefi, I don't have rarefi there anymore. So we're in good shape. Again, that's a little thing that you run into from time to time when you have clashing functions that are coming from different packages, or you know, something that's in a package that, you know, you're writing your own version of so just be aware that you could also do that vegan colon colon rarefi. All right. So again, we have seen now how we can get out a vector of values that are rarefied to a specific number of observations. It's a named vector with 227 values, each value has a separate name. Well, I can also pipe this into as underscore tibble. And I could say row names equals group. And now I get a tibble, where I've got a group column and the value. I could also go ahead and pipe this and do a select group. And I'll say vegan equals value. And one thing I wonder is, is vegan using the exact rarefaction? Or are they using the bootstrap, the randomized process? I'll call those vegans. And then I'm going to do the same type of thing, but with my rarefi. And so again, we'll have shared and we'll pipe that to group by group, right? And then we'll do a summarize. And we will say mine equals my rarefi. And then we want the value column, and we'll do it to min and seeks. And this then gives us my output, right? And I'll then call this mine. And then I will do an inner join between these things. So I'll do inner join vegans, mine by group. And so now I've got those two columns, and I could then do a mutate. I'll call it diff of vegan minus mine. And I'll go ahead and do a summarize mean diff and SD diff. And so what we get is that there is no difference. So it does appear that vegan is using the exact calculation for rarefying your data. So why would you want to rarefy the data using a random approach? Well, this exact approach only works for something like richness. As we've seen in previous episodes, things like diversity metrics such as the Shannon weaver index and the Simpson index, they can be sensitive to sampling effort. And so there isn't a exact solution for determining the diversity at a specific number of sequences or observations for those diversity metrics. So in that case, it would be useful to go ahead and actually do the empirical rarefaction approach. Very good. So now we have the expected number of taxa that we would see for a given sampling depth across all of our samples. And we've also seen up here on lines 42 through 44, how we can get it basically out of that named vector and create a table so that we could go on and do other things like plot the data or you know, anything else we'd want within the tidyverse. The next command I want to share with you is our rarefying. So our rarefying allows us to get a single sub sampling of the data to then do whatever you'd want with it say, right? Now, basically, this would be like one iteration of the rarefaction step. So we can do our rarefying. And we can then give it again, shared the f and I'll do sample equals min and seeks. And so then this gives back just a mess. And we can then pipe this to as tibble. And we'll say row names equals group. That gives us a tibble. And then we could make it long by doing pivot longer and do minus group. And we're right back to where we started again, our rarefying gets you one sub sampling of all your data. You can think of rarefaction as basically doing 100 or 1000s of these sub sampling steps and then pooling all the data together. So if you wanted, you could take this data, calculate say Shannon diversity, and then repeat our rarefying calculating Shannon diversity, repeat that 100 or 1000 times, and then get out an average Shannon diversity to again get an empirical version of rarefaction on like the Shannon weaver diversity index. Another great tool from vegan is D rarefi. And so this then tells you the probability of seeing a specific taxa for a level of sampling effort. So we can say shared the f and we'll say sample equals min and seeks. Again, we get a really gigantic data frame that's quite wide. I'm going to grab these couple lines of code from what I had up above here with our rarefi to then output the group, the name of the taxa, as well as the probability of seeing this taxa say O2 in F3D0, if we got min and sequences, right? And so what you'll see is that the output here to a tibble is a bit truncated, right? For most of these values always see as one. Others we maybe get three total significant digits like one zero and zero. And so three significant digits is the default number of significant digits that a tibble will share with you. If we want to increase the number of significant digits, we can alter the options for how we display values in a tibble. To do this, I'm going to say old equals options. So I can then give options pillar dot sig fig equals seven. And if I look at old, I see null. And so what that means is that it previously had used null, which is tells pillar, which is the package that tibble DF uses to figure out the significant digits to use the default. So the old version was using the default. Now we are using seven significant digits to the right of the decimal place. Now when I run D rare fi, I get seven significant digits, right? And so you can see that O2U2 is pretty close to, you know, dead on that we're going to see O2U2 when we get min and seeks, right? And so if I perhaps looked at the tail of this, we would see that some, you know, this O2U2053 has about a 23% chance of being seen in sample M69. Whereas these other taxa, we're not going to see them ever really if we sample to min and seeks, which again was 1828. Cool. So if I want to go back then to the default, then what I could do is options old. And now if I rerun this, we'll see that we go right back to what we had before, which was three significant digits to the right of the decimal place. Okay. So again, this was a useful tool to see how we can calculate the probability of seeing each taxa in each sample for a given level of sampling effort, and how we can use the options function with the pillar dot sig fig argument to set the number of significant digits we want. Again, we could take this up to 10 or whatever we'd want to see those decimals to the right of the decimal place. So the last function I want to spend time talking with you today about is rare curve, rare curve, we'll take in our shared DF data frame and make refraction curves. So we could do rare curve. And then I can give that shared DF. And then I can say step equals 50. And so what step equals 50 does is it says only output the data every 50 observations. So I don't need to see 123456789 all the way up to like 10 or 20,000. Instead, it'll go 151, 101, 151, right? You know, we might even go up to 100. Let's do that. What we get out of rare curve is those 227 refraction curves. And we get a text box for each sample on top of those curves. This is a figure that's generated in base R. I'm not interested in using base R. I'd like to know how I can get the data for the curves out of base R and into GG plot two. And so what I'll do is I'll go ahead and pipe this into str. So the str function reveals that the output of rare curve is that figure, but it's also a list. And I know it's a list because I've got that dollar sign, but it's not keeping the name of each of the samples next to that dollar sign. And so that's a bit annoying. But the other thing that's interesting is that each element in this list is a named vector. And so we can see here, say 142.159.6 and so forth, are the number of expected OTS or taxa that we would see for sampling 101, 201, 301 and so forth, right? And so then what we want is we want to get this out of this kind of funky list and into a table so we can then go into GG plot. So I'm going to go ahead and save this as my rare curve data. I'm going to take rare curve data. And I'm going to use map DFR on rare curve data. And basically what this will do is it'll take each element of the list, which is rare curve data, and I'm going to then do bind rows. And so what bind rows will do is it'll bind together each of those elements to make a data frame. And what we'll see is that we have N101 and so forth. These are our sampling depths going across the column names. And then our rows are our different samples. We're going to need to bring back in those sample names. But if I were to say come out here and do select, I'll do N1001. And I'll see that I've got a bunch of NA values, right? And so these would be samples that didn't have 10,001 sequences. And so what's happening with bind rows is it's plugging in NA values, where there's a value missing for that column. Okay, so we kind of have an idea of what the data look like. I'm going to go ahead and pipe this into a bind calls function. So I'll take the row names of shared DF. And then yeah, and then we'll kind of put the stuff coming through the pipeline to the right of those row names. And this then gives me the groups, although I didn't give it a name. So I'll call that group equals row names shared DF. And so now what we get sure enough is a column for our group, and then our different taxa. So now because I want to work towards putting this into ggplot to make the line plot, I need to go ahead and pivot longer to make it tidy. So I'll do pivot longer. And we'll do everything but the group column. Very good. And so the next thing I want to do is go ahead and take that name column and turn it into the number of sequences. It's currently a type character because it has the n and then the number. And so I'll do a mutate. And we'll do n seeks equals str replace. And I will then say name. And I'm going to take the n and replace it with nothing. And I'm going to wrap this in as dot numeric to convert the text into a number. And so sure enough, now we have n seeks as a double. And I can go ahead and do a select minus name. The name column isn't causing any problem. But let's let's get rid of that. And now what we can do is we can go ahead and do ggplot as x equals n seeks y equals value. And we'll do group equals group. And we'll do a geom line. And looking at our figure, we get some really funky output here. It almost looks like I use geom point rather than geom line. And when I come back to my console, I see it removed 76,118 rows because they contained missing values. So if you remember, when we looked at the right side of the output of our map DFR, there are a bunch of NA values. Well, that's what's going on here. So we need to go ahead and get rid of those NA values. So after pivot longer, I'm going to go ahead and do a drop NA and feed that into the rest of the pipeline. So there we go. We have 227 refraction curves. That's not super helpful. You know, something that we might add to this would be a vertical line to indicate the depth of sampling that we've been using that min and seeks. So I'm going to go ahead and before running geom line, I'm going to run geom v line. And I'll say x intercept equals min and seeks. And I will make this color equals gray. And add that to there. And I'm going to go ahead and do theme classic. And so what I get is the vertical line to indicate where 1828 was, which is the value of min and seeks. And that that kind of shows us where we're looking at for you know, our level of sampling effort. Again, with 227 samples in this refraction curve, it gets to be kind of nasty, right? Like you really can't see anything going on in here. And, you know, I don't think there's much value in looking at the refraction curves. I would far rather look at the output of rare fine and then convert that into a tibble, so that I could kind of go on my way and looking at statistical analyses comparing, you know, different time points in these animals lives and whether or not, you know, something happened between different time points to change their number of taxa in their community. So I hope you find this discussion useful of how we can use vegan. In previous episodes, I've talked about how we can basically make our own versions of these functions. It is nice to be able to use vegan and get vegan to do it all for us. Oftentimes, the packages like vegan are far better engineered than what we can do on our own. It also kind of removes a lot of the guesswork. Sometimes the math in these functions is a lot more than we want to kind of carry on our own. And so it is nice to be able to rely on something like vegan. At the same time, I do think there is value in going back through those previous episodes so you can understand the mechanics, what's going on in these different functions. If you had to code it yourself, how would you do it? As we've seen, these functions aren't doing anything super complicated. And so, you know, we're gaining a bit by being able to use vegan, but it's also not something that we can't do on our own. Hopefully this episode also underscores the value of getting really flexible and comfortable using tools from the tidyverse. The output from these vegan functions is a little bit gnarly, right? And so the ability to figure out how to get them into a tibble so that we can do things like pivot longer, pivot wider, group by summarize, ggplot. It's all really powerful and really useful and things that we need to keep practicing on so that we can get better and more comfortable doing. So, I encourage you to keep practicing with this and we'll see you next time for another episode of CodeClub.