 Hey, folks, I'm going to try one last approach to transform our data so that we could avoid rarefying our data before calculating beta diversity indices. That's what we're going to do in today's episode. We're going to do that using variance stabilization transformations, which is a technique that we can get out of the de-seq2 package, which if you look on CRAN, you will not find de-seq2 on CRAN. So where is it? Well, it's in bioconductor. And how do we get packages out of bioconductor? Well, you'll see how to do that in today's episode as well. So the variance stabilization transformation approach was suggested by McMurdy and Holmes in their PLOS paper that starts out something like waste not want not why rarefaction is inadmissible. This paper has gotten a lot of play. There's whatever we submit papers and we rarefire data. People always reference back to this paper. I think they're asking me my opinion of the paper. I've never really said anything publicly about it. So I will hear today. And so in this episode, I'm actually going to compare what happened when you transform your data using variance stabilization technique and then calculate a distance matrix, compare that approach to rarefying your data and calculating the distance from the rarefied data. We will be able to make a comparison and see whether or not this variance stabilization approach mutes out the effect of having different numbers of sequences in your pairs of samples. You'll recall that the big problem that we're dealing with here is that at least in the data set I'm working with and many other data sets pretty much every data set I've ever worked with, we see between like a 10 and 100 fold difference in the amount of sequences, the number of sequences in each of the samples. This has big implications for the distances we're calculating. And so in our simulation, we are simulating a from a meta population where we've taken all of the counts for different taxa, pulled them together, and then we sample out of that meta population to regenerate about 225 different samples that all have the same number of reads in them that the original samples did before I kind of pulled everything together. This is what I'm calling a null model. In other words, if we had infinite sampling, there would be no difference between these replicate communities. But as we know, there are differences because again, we don't have infinite sampling. We have partial and sometimes variable levels of sampling. And so what we're asking is if I apply these different transformations, do I get a consistent pairwise distance between pairs of samples, regardless of the number of sequences in each of those samples? With rarefaction, we've seen consistently that we remove that influence of the differences and the number of sequences between the samples. We've tried a variety of other approaches at this point. You can go back and look at the rest of the playlist to kind of get the blow by blow there. But the last approach I want to try is this variance stabilization technique that comes to us from DEC2 that was recommended by McMurdy and Holmes. So I have an R script that I've created called variancestabilization.r. What I've done is I've taken basically the top chunk of code out of the script that we used in the last episode to calculate those HSN distances. You'll see what I'm doing is I'm loading a variety of libraries. I'm going to leave the Z compositions in here because we might need to use that later in today's episode. And then I read in this shared file, which is my version or mother's version of an OTU count table. We then generate a randomized version of that shared data by again pulling together all of the OTU counts effectively and then randomly assigning individuals back out to samples so that the total number of sequences in each of the samples matches the total number of sequences that were in that original shared variable. And so then I go ahead and calculate the number of sequences in each of the samples or the different groups. And then I've gone ahead and created a data frame that could be fed into AVG dist to do a rarefied version of Euclidean distance, which is the distance that we will use on our variance stabilization transformed data. It's also what we used in the HSN episode last time. So I'm going to go ahead and run all this and get everything loaded into R before we get going today. So everything in this initial version of the script works. Everything is loaded very nicely. I should say if you want to get a copy of this R script down below in the description, there's a link to a blog post where you can get all the information from GitHub about how to get set up. Also across the top here, I've got a video that'll kind of walk you through the steps of how you can get caught up to where we are today. So the first thing that I want to do is install DEC2 so that we can do this variance stabilization technique. I'll come over to my packages tab and type in DEC2. There's nothing there. I don't have anything installed. If I go to install and type in DEC2 and type install there. So the output tells us that there's no version of DEC2 available for my version of R, which I'm running version 4.1.2. But it's never been available for a version of R. And that's because DEC2 is a package out of Bioconductor. And Bioconductor is a collection of highly vetted R packages that are all assembled together in this entity called Bioconductor. CRAN is where most of our packages come from. CRAN has a certain level of screening that they put every package through to make sure it's robust. Bioconductor kind of takes that up another notch. And Bioconductor typically has packages that are useful for people in biology. So we need to figure out how to install DEC2 from Bioconductor. So I'm going to come over to my browser and do what I always do and I'm not sure of the answer and Google it. So I'll do DEC2 install. And the first link is DEC2 from Bioconductor, right? And so this gives a webpage about DEC2. You see it's part of Bioconductor. And it gives us really nice useful information to install the package. Start R, version say 4.1 or later, which we're using. And then it run this, right? And so running these three lines of code will install BiocManager, which is Bioconductor Manager. And then it will then use the install function from BiocManager to install DEC2. So I'm going to paste that into my console here and give that a run. This might take a few moments to run, to load BiocManager if I don't have it installed, and then to also install DEC2. So that installed, there were a couple questions that came up as it was installing. So the first was whether I wanted to update everything. I said, yeah. And then it also asked if I wanted to compile things from source code. I said, sure. And I'll go ahead and paste it up here at the top in case you're coming along later looking at this script and want to see how I installed Bioconductor as well as DEC2. I am going to comment it out though, because whenever I rerun this script, I don't want to be reinstalling DEC2. The other thing I need to add here is the library function call to load DEC2. It's another one of these kind of funky capitalized packages, but thanks to RStudio, it makes it a lot easier to get the right capitalization of that name. Now there is the order that I want to load these packages. And I want tidyverse to be loaded last so that things from tidyverse aren't getting masked over. So to make sure this all works, I'm going to quit out of RStudio, restart R in RStudio so that everything loads in the right order. So everything ran through without any errors. The next thing that I want to do is take a look at this variance stabilizing transformation function from DEC2. Again, I can do variance stabilizing transformation and looking at my help tab, I get the description of it. I see the usage and I can see that it takes the object, the variable, the data frame, or the matrix, blind as an argument and fit type. Looking at the arguments for what that means, the object is a DEC dataset or a matrix of counts. We're going to stick with the matrix accounts because I don't want to mess around with trying to figure out how to make a DEC dataset. The blind is a logical so it takes true false. The default value is true. The other documentation from DEC2 suggests just leaving this as the default. The fit type is how you tell the function how to fit to control for that variance. Looking further down here in the description, you see there's three different fit types, parametric, which is the default, local, and mean. We'll come back and maybe try the others, but I'm happy to stick with the default if that works well. So as we saw from the help page, the variance stabilizing transformation function requires a matrix as input. And so I'm going to do rand matrix equals as not matrix on rand df. One question that I have though is whether or not I want my samples in the rows or in the columns currently there in the rows. So I'm going to go back to the help for the stabilizing transformation function. And down at the bottom, there's a set of example steps to kind of walk you through how you might use that function. I'm curious to see what it's expecting the DEC dataset to look like. So I'm going to go ahead and generate this DDS variable. And if I look at DDS, again, this is their toy data set for showing how you would use variance stabilizing transformation, that function. And we see that it's got 1000 rows. And those are gene names. And so in my world, those are the same as ot us by analogy. And then the column names are the different samples. So I want my samples or my groups as I'm calling them as the columns and my ot us as the rows. So I do need to transpose the matrix. And so now if I look at rand matrix and I pipe that to str, I now see that I've got 1588 rows, 227 columns, 227 columns corresponds to my 227 samples were in good shape. So another challenge that we are going to have in using variance stabilizing transformation is that we have a lot of zeros in our matrix. And so we have those zeros for a couple reasons. So it could be that the, you know, the entity is below the limit of detection. It could be that that taxa just wasn't there in the sample, which is actually very common in the microbiome world. Everything is not everywhere. And in this data set, our samples are coming from a large meta community. So in a way, those zeros are below the limit of detection, they are effectively from that meta community, they are in that meta community. But because we have incomplete sampling, we're not seeing them. But in the real world, there would be zeros in there because the organism was absent, right? So as a way around this problem of the zeros, what McMurdy and Holmes did in the waste not want not paper was they added one to all of the counts. And so basically the pseudo count as it's called makes it so we don't have that problem with the zero. As we saw yesterday in the last episode with Acheson distance, they imputed the zeros using this Z compositions package. And so we'll come back and we'll try that Z compositions package in a moment. But I want to try it first, the way McMurdy and Holmes did. And so now all of our entities, all of our values in the matrix have an value of at least one. And so now I can take variance stabilizing transformation, feed that rand matrix. And I'm going to call this VST matrix. And it's complaining that the default that we used was type fit type parametric, but the dispersion trend was not captured by the function. So they used a local regression fit automatically. And so that's fit type local, we could come back and use fit type mean later if we want. So let's go ahead and be explicit about what we're doing and say fit type equals local, that went through pretty well, no problems. So now we have our VST matrix. So to calculate the Euclidean distance, I'm going to use a veg dist. And so that actually needs the transpose of this matrix. So I'm going to go ahead and take this output and pipe it to the t function. And that way, it'll automatically be transposed as VST matrix. And then I'm going to do VST dist as a veg dist on VST matrix. And my method will be Euclidean. And now I want to convert these distances into a table so that I can compare them back to what I had already, which back up here, if you recall, was the rand dist, the rare dist, so the rarefied data frame, right? So I want to be able to compare rare dist to VST dist. So as we've seen already in the previous episodes, the output of veg dist or AVG dist is a distance object. And so we can turn that into a matrix with as dot matrix. And then we can turn that into a table with as table. And I'm going to do my row names are going to be my samples. And so now, yeah, now I've got a column with samples, and I've got my square matrix off to the right there. I'm going to pivot this longer. Everything but samples. And so now I've got my three column data frame, and I only want the one triangle. So I will then do filter name less than sample up samples. And so now I've got my tidy version of the VST distance. So I'll call this VST DTBL. And I'm going to repeat this actually for my rare dist. I've done this so many times, you'd think at this point, I'd have a function, but hey, what can I say rare? And then let's go ahead and clean this up and load all these objects. Great. So now I have rare DTBL. Great. And so now what I can do is I can join these together. And let's do inner join. And I'll put the rarefied version first, rare DTBL, VST DTBL, and we'll join by samples and name. You might be asking like, why am I saying samples and name when those columns are shared between the two data frames? R should be able to figure that out automatically for me, right? And so in a way, that's true. But let me show you what happens if I don't put that by, right? So if I put the parentheses there, then you'll recall that rare DTBL and VST DTBL have the same three column names, so samples, names, and value. And so if I leave out the by, it's going to try to join on all three columns. And that's going to get kind of funky. Because it's going to say basically, there aren't values for the same samples and the same names, right? And so it's going to say, well, with this inner join, there's nothing shared between the between the two data frames. And so again, that's why we need to be explicit in this by statement of what we're joining together, okay? So I want to join this with my group count so that I can look at the differences in the number of sequences between the two samples being compared in the distance calculation. So I'll do another inner join with what's coming through the pipeline and group count. And here I'll do another by, but it's going to be a little bit different, because I'm going to say sample equals a group. And I misspelled group, of course, that works. And so I want to repeat this, but join on the name column, right? So instead of samples here, I'll put name. And so now I've got my two columns, and I can use mutate to calculate the difference. I'll use the absolute value of the difference between n seeks dot x minus n seeks dot y. Good. And now I can do a select to get samples name. And then I want what so value x is going to be rare equals value dot x. VST is value dot y. And then diff. What I'd like to generate now is a visual where I have two facets, one for the rarefied data, one for the variant stabilizing transformation data. And on the x axis, I want the diff values, the difference the number of sequences, and the y axis should be the distances. So I need to go ahead and pivot longer to gather together the rare and the VST column. So I'll do pivot longer. And I'll do rare and VST. And again, I've got name in there twice. We saw this in the last episode, I think. So I'll say names to equals method values to equals dist. Good. And now we can feed this into gg pot. As x equals diff, y equals dist. And we'll do geom point. Maybe I'll go ahead and throw in geom smooth. And then we'll do facet wrap with the method column. And I'm going to go ahead and do scales equals free y. And that will give each facet its own y axis scale. So this is putting them side by side. I could have put it top bottom. I'll do that here in a moment. But this is pretty discouraging, right? So the rarefied data we see is pretty invariant to the difference in the number of sequences between the pairs of samples, whereas the VST is definitely sensitive to the number of sequences, the difference in the number of sequences between the two samples, climbing as that difference increases. So something we might go back and try would be using different methods for doing the parameterization of that variance stabilizing transformation. And instead of adding one to everything, we could use the Z compositions package to impute those zero values. So let's go back and give those a shot. So before I forget, let me do n row equals two. And the next thing I'm going to do is I'm going to come back up to the variance stabilizing transformation. And instead of fit type equals local, I'll do mean. And then I'll go ahead and run everything else in the pipeline to see if that changes the appearance of this figure. And sure enough, it doesn't change anything. It's still climbing for that VST as you increase the difference in the number of sequences between the two samples, the distances go up. So that's, that's a bit troubling. All right. So one other thing I want to try is the pseudo count. So here I'm adding one wonder what happens if we kind of reduce that. So do 0.1. And so sure enough, I'm getting an error message. That's basically saying that variance stabilizing transformation function is expecting all of the values to be integers, they're expecting it to be counts, right? And so if I'm giving a fractional value like 0.1, that's not going to work. By the same token in the McMurdy and Holmes paper, they suggested using interpolation or imputation methods to impute those zeros kind of like we did in the last episode when calculating the CLR HSN distances. And so that's not going to work here, right? Because that method imputed values for those zeros, and those values tended to be quite small fractional values. And so I thought I was going to go ahead and try to add in those Z composition values. But it's just not going to work. And so I think this is as best as it's going to get. So we've tried a variety of different ways to get around having to verify our data. We tried not rarefying the data, we tried relative abundance, we tried normalizing all the samples to have the same number of counts, we tried HSN distances using robust CLR, as well as HSN distances using CLR where the zero values were imputed. And here we have tried to use the variance stabilizing transformation from the EC2 to go into a Euclidean distance. And again, with all these methods, we find a significant sensitivity between the difference in the number of reads of the samples and the distance. The only approach that gets us away from that problem is rarefying our data. So one of the claims in that paper, ways not want not wise, you know, rarefying is statistically inadmissible, is that we're getting rid of data when we verify the data. We're getting rid of perfectly good data. And the reality is we're not really getting rid of the data. We're calculating the mean. We're using a whole bunch of data to calculate a mean. And so I really don't see it as getting rid of data. And so perhaps I don't understand the jargon of statistically inadmissible, but I'm happy with using rarefaction. If it means that I can control for uneven sampling effort in our different samples, then, you know, I think rarefaction is the way to go. Ultimately, what would be great is if we could figure out how not to have 10, 20, 30, 100 fold differences in the number of sequences between all of our samples. But that is very hard. And for now, I'm content to use rarefaction when dealing with my beta diversity indices. So I hope you've found this extended discussion of rarefaction to not be totally boring. I hope you find it useful. I hope you find it in compelling how I took a problem of assertions that were made in the literature and use data and used R to test it. As always, I would love it if you took your data, ran it through the paces that I've run mine through and see if you get a different result. Also, if I'm doing something totally stupid or missing the point somehow, please let me know, right? And so again, I really hope that this helps you to make your analysis more robust. And the next episode, I'm going to start looking at alpha diversity. And yes, we will talk about rarefaction there as well. We'll talk a little bit more about vegan, as well as kind of understanding the nuts and bolts of what rarefaction is doing, especially when it comes to alpha diversity metrics. So that you don't miss that or any of the other exciting videos that I have in mind, please be sure that you've subscribed to this channel. You click that thumbs up button for this video so that you've told me and everyone else that you love it. Keep practicing with these concepts and we'll see you next time for another episode of Code Club.