 Hey folks, if you've been watching recent episodes of Code Club, you know I've been doing a pretty deep dive, pretty thorough examination of different ways of calculating ecological distances. It's not so much how we calculate those ecological distances because we're using the Bray-Curtis distance for all of our analyses, but mainly do we do things like do we rarefire data? Do we use as input to those distances relative abundance data? Or do we normalize our accounts so that every sample we look at has the same number of observations? In the project that I'm drawing the data from, we had something like 225 different samples and they varied by about 20-fold in the number of sequences that we found in each of the samples. And so the question then is what do we do to control for that uneven sampling effort? As I've shown, we can see variation in distances based on the variation and the number of sequences for the two samples that went into that distance. So in other words, if two samples had very similar number of sequences, the distances between them would actually be larger than you would expect relative to if you had a large difference in the number of sequences between two samples, that would actually give you a smaller distance between those two samples. So that got me thinking and that we're working with a null model where I took again that mouse data where I drew the initial data from and I kind of reshuffled all the observations so that we controlled that every sample had the same number of sequences or observations as it did originally and the same number of O2Us or individuals in each O2 as it did originally. But we kind of moved everybody around so that the individual samples represent a statistical sampling of an overall community called a medic community or whatever you want, if you will. And so if I randomly divided the samples in half, then compared them using a statistical test like Adonis, I should not see a difference between those different communities. But what I'm wondering is if instead of randomly assigning samples to one of two different treatment groups, again just arbitrarily, what if I assigned them based on the size of the library so that all of the smaller libraries, the libraries with number of sequences below the median, went to treatment group A, everything above the median went to treatment group B. And then I ran a statistical test. What would I find? How often would I falsely say that the two treatment groups were significantly different? That is what you might call a false positive or a false positive rate or the type one error. We would expect that 5% of the time, if we're testing at a 5% significance level, we would falsely say that they were different. Up again, we skew things by the size of the library or ideally if we don't skew things by the size of the library. So that's what I want to dig into today with you. We're going to build off some of the concepts we've talked about in recent episodes, doing a lot of pivot longer, pivot wider. We'll again see the future map DFR. And so we'll see a lot of our good friends and use them to answer what I think is an important question and thinking about how we handle our data. Do we verify or not? And we'll dig into that in today's episode. So I've got my distances.r script open here. If you want a copy of this, it is a bit of a jumbled mess at this point because we keep adding to it and modifying it. But you can get this if you go down below in the description to the link to the blog post will give you all the coordinates at GitHub that you need to get things set up. Again, there's a link up here to a video showing you how you can use all that information to get the code to get the data, everything you need to follow along today. And following along is the most important thing you can do because you'll actually be like engaging with the material. I'd also encourage you, once we go through this video, to try this with your own data and let me know what you find. If you take your own data, kind of run it through these paces, do you see results like what I'm getting for this mouse data? So to get going, I'm going to open up a new r script, and I'll call this false positive dot r, and I'll throw that in my code directory. And I'm going to borrow some stuff from distances dot r. I'm going to take these libraries. I'm going to take all this stuff where I calculate the random null model. So I'll go ahead and copy that over to false positive. And so again, if I run all this, I then get ran to be a tidy data frame. This smallest group I think is important because this will indicate the size of the smallest sample. So I'll bring that over here into my false positive dot r script. So instead of going back through distances dot r, I'm going to do it with you here live, generating the four different types of distance matrices, and then we'll feed that into adonis. So to first start out with our relative abundance. Again, if we have rand, we can take rand, and we can then group by the group. And then we can pipe that to a mutate, where we take or calculate the relabund of n divided by some on n. And so now we have a relative abundance column, we do have that grouping variable still. So we'll do ungroup on that. And we'll go ahead and do a select to get group, rand name, relabund. And I will then pipe this to pivot wider, because we want it to be a data frame with our OTS or ran names in the columns and the samples of the group column as the rows. And so we'll pivot wider names from, and that will be rand name. And then values from will be relabund. And I think I need quotes around brand name. And then we'll do values fill equals zero. Because again, there's gonna be combinations of values from group and ran name that don't exist in this tidy data frame that when we go wider, it will fill with NA values. So again, we now get this table, and it needs to be a data frame, because we need the sample names as row names, you can't have row names with a table. So do as dot data dot frame. And great. So now I'm going to call this rand df. And then rand df, the row names on that are going to be rand df. And it's going to be the first column. And then we can get rid of those row names from rand df by doing rand df square brace minus one. So the minus one removes the column, right? And so now if we do str on rand df, we see that we've got columns for all of our OTS and we should not have a column here for our group. And sure enough, we don't. So instead of calling this rand df, I'm going to call it instead relabund or rabund df to make it a little bit more descriptive of what the data represent. Right. And so let's go ahead and run these things to get rabund df. So I got a little excited there and going straight to calculating the relative abundances. We also need a data frame that's not rarefied. And so if we take rand, we can see this data frame, we've got group and ran name. I can skip over the group by mutate ungroup and select to go straight to making it wider. And so that will be our non relative abundance data. So we'll do pivot wider. And we'll do names from equals rand name. And then values from equals n. And again, values fill equals zero. And we'll pipe that into all this good stuff where we get the data frame. And we do the thing with the column names, the row names rather. And I'm going to call this no rare df. And then we can take no rare df and put that in for rabund df. That's great. So no rare df will use to calculate a distance matrix without rarefaction, as well as a distance matrix with refaction. So we've got those basically three data frames, if you will, squared away. So now we need to normalize with SRS. SRS comes to us from the package SRS, which I loaded way back up here up here on line three, right? And so SRS normalizes the counts that every sample has the same number of reads. And so unfortunately, it needs the data in a slightly different format. SRS needs a transpose of what we have in no rare df. So I can take no rare df. And I can pipe that to the t function, which will transpose it. So the output of the transpose is a matrix and the input SRS needs to be a data frame. So do s dot data dot frame on that, giving us a data frame, we could pipe this to str to prove it to ourselves. Again, we see the dollar signs telling us that it is in fact a data frame. And so we can now pipe this to SRS. And we'll use the data coming through the pipeline in that first spot for the data argument. And then see men will set to be the smallest group. So now we have a normalized data frame with our sample names and the columns, we actually want those in the rows. And so we'll then pipe that back to transpose, we now get our OTS across the columns and our samples as the rows, we could pipe this to dim to prove that to ourselves. Yeah, we have 227 rows in 1588 columns. I'll go ahead and call this normal df. It's actually right now a matrix. I'll go ahead and just make it an as dot data dot frame. It's not necessary, but let's just keep the names all pretty consistent. And so again, now we have normal df, no rare df, which will go in for calculating a non rarefied distance as well as a rarefied distance. And then we also have the Arabon df. And so now we need to go ahead and calculate our distances. So do no rare dist. And that will be veg dist on no rare df and method equals bray will do our abundant dist as veg dist are abundant df again method equals bray do normal dist as veg dist normal df method equals bray. So now we have our three distance matrix that don't require a rare faction will now do rare dist as AVG dist with no rare df d method bray and then sample equals smallest group. So now we have our distance matrices. Now we need something to tell adonis, which is what we're going to use to test for differences in distance. What samples belong to each treatment group. So I need to create a distribution of the number of reads per sample. And so I'll come back up here. And I realized that when I defined smallest group, I actually forgot to create rand group count, have it over in the distances that are script. And these are the problems with spaghetti code and moving things all over the place. So we'll do rand group count equaling rand, and we'll group by group. And then we'll summarize n equals sum on n. So let's make sure this looks right. That's good. And then smallest group. Yeah, 1828. That looks good. And so I'd like to come down with rand group count. And do treatment. And I'll do an if else. So if else, rand group count, dollar sign n is less than the median on rand group count, dollar sign n. So if it's less than that, I'm going to say it's treatment group a, it's greater than that, it's going to go B. Now if I look at treatment, I get these a's and b's, and they correspond to the number of sequences in each of the samples. Great. So now we're ready to feed this into adonis. So adonis comes to us from the vegan package. And so because we have vegan loaded, we're all good to go. And so now I can do adonis. And then I'll give my distance matrix tilde the treatment group. So I'll do no rare dist, tilde treatment. And so what I find is that the p value is actually quite small, one in a thousand. It does 1000 permutations or 9999 permutations. And so the p value that came out was quite small, saying there is a significant difference between the two groups as defined by whether or not the sample was larger than the median number of sequences per sample. So hold on to that. We'll come back to that. But I would like to extract that p value because I want to know whether or not this test is significant. I want like a true or false, right? So what I could do is let's go ahead and assign this to a no rare test. And then what we can do is str no rare test. And so we get a whole bunch of output for the structure of this object. And the object is a list with seven things, right? And so there's an AOV tab. And so what I want to show you briefly is how we can get this number, right? So we need no rare test dollar sign AOV tab test dollar sign AOV tab. And then we'll do another dollar sign with back ticks PR greater than F. So let's run that and see where we get. Yep. And so it's a vector with three units as we could see back here, right? So it's a vector with three units and we want the first unit or first seat. So we can then use square brackets to get the one. And so we get 001. Another way that we could have written this would be like so without the dollar signs, we could wrap each of these elements in double braces with quotes. Let's see. And those characters in column names or in variable names is always a bit of a pain. Yeah, so this needs to be in quotes, right? Yeah, so we get 001, right? So these two lines do the same thing. I think I'm going to stick with the dollar signs though, because what I'm going to do is I'm going to concatenate this all together. And so now no rare test is the value 001, right? And so I can say no rare SIG, no rare SIG equals no rare test less than 0.05. And so now this should be true, right? Great. So now we can repeat this for our other tests, right? And so we have no rare, we want rare. We also want rare dist, put rare there, and then we want our abundant. So again, this code isn't super dry. I'm not overly concerned about it because there's only four things and they're all right here together. But it might be a good practice to figure out how to make this a function. So I'm not repeating the same thing four times. And then here we want normal test and normal dist. I missed an end in my abundant, a bud. And then I could create a vector. And so I could do no rare equals no rare, SIG, rare equals rare SIG, our abundant, our abundant, SIG, and then normal equals normal SIG. And so then I get out this named vector with two true values for the no rare and the relative abundance and two false values for the rare and the normal. And so you're probably thinking, oh my gosh, those tests gave significant results. But hold on, because this is for only one randomization, one construction of RAND, right? So we could have gotten lucky or unlucky in generating RAND. And so what we need to do is to repeat the construction of RAND and all the testing, say like, I don't know, 100 times. And so we can run it 100 times and see how many of those reshufflings of shared to generate a random null model of the shared data will give you a significant value. And so that's what we need to do. So we need to encapsulate all this in a function. And so I will call this, I'll create a function called run iteration. And this will take a value x, and we'll open up the function with that curly brace, I'll highlight everything, bump it over a smidge, and then put in a closing curly brace. And I want to pass to this an argument x, which is going to be my seed. So I'll do set dot seed on x. So I'll hand in run iteration, a value between one and 100. And for each of those values, those values will become the seed for the random number generator that we use for things like sample, or for AVG dist, I think also for SRS, it can also take a seed, we can do set seed equals true, and then seed equals x. Again, if we load run iteration, and I do run iteration on one, I again get two trues and two falses. But again, that's for only one iteration of the data. And so I'd like to run iteration a bunch of times. So we'll go ahead and we'll do map DFR. And let's start with one to three. And we will then give that run iteration. And I'm also going to do dot ID equals seed, this will add a column called seed with values of one, two and three for the three seeds that we used to run this iteration three times. So that worked pretty well. We got three seeds through took a couple minutes. And all of the values for no rare were significant, as well as for our abundant. The rarefied data, all three of them were false. And the normal one of the three was true. So we now want to scale this up to do 100 iterations. So I'm going to put in 100 here, but I'm not going to run 100 iterations on my laptop, because that would take too long. I'm a little bit impatient. And so I'm going to parallelize this across the 16 processors on my laptop. And I'll do that using the fur package. Again, fur is like per, except it incorporates the future package that allows for parallelization. And so I'll come back up here to the top, and I'll do library for and then I'll do plan multi session. And so that's how you can achieve parallelization working in our studio. So then I can change map DFR from per to fur by adding future to the front future underscore, I'll then call this SIG results. And we can let it rip. Alright, so that took a while to run. If we look at SIG results, we now get this table of truths and falses 100 rows for 100 seeds. That's fantastic. We'll go ahead and take SIG results. And I now want to summarize by finding the number of truths or the proportion of seeds that had a true value for each of the four different approaches. So do SIG results, pivot longer. And we will do everything but seed. And we'll do names to and let's do method. And then values to equals a significant. Great. And so now we can group by our method. And we can then summarize. And because it's a logical, we could actually say mean. So I'm going to summarize to be false pause equals the mean on our significant column. So the summary data are very interesting. Again, for this experimental simulation, where we are assigning samples to a treatment group, based on the number of sequences, right? So if you were a small sample, you got put into group a if you're larger, you got put in a group B, right? So this would be a case where there's confounding between the treatment group and the number of sequences. I admit that this is a worst case scenario yet. Our methods should be good enough to handle worst case scenarios. That all of the times 100 out of 100 times when we don't verify it comes back as significant. If we use relative abundance 100% of the time, it also came back as being significant. If we use that SRS normalization procedure, 26 out of 100 times, it came back as being significant. And if we rarefied, then two out of 100 times, it came back as being significant. This number should be about 0.05. The reality is, if you ran this procedure multiple times, you would get fractions that came out more or less like these. But the values might bop around a little bit. So that this rare is three off of what I would expect by theory to be, I'm not super concerned. If I was doing this for a publication, I might up the number of iterations from 100 to 1000. But I feel pretty good that this rarefied data came back right around the theoretical value that we'd expect of 5%. Again, if you use a 5% or 0.05 p-value threshold for determining significance, then what that means, and don't get me trapped in what a p-value means, but basically, if you don't expect to see a difference and you do a test, what percentage of the time should that test come back as being significant? Well, it should come back as 5%, right? So 0.05. And so that we got 0.02 isn't so far off and I'm pretty happy with that. Again, I think this illustrates some of the problems with alternative approaches to rarefying and sure normalization worked fairly well, but still had a pretty high false positive rate about 25%. That would give me pause. So I'm happy to continue using rarefaction. I think it protects against these types of situations. My simulation may seem a little bit over the top. A couple episodes ago, I brought up the reality of this type of simulation where I talked about a study we did where we looked at the oral microbial community and compared it to the lung microbial community of the same patients. And what we found was that it was really hard to get a lot of 16S sequence reads out of the lungs because there was a lot of contaminating host DNA that would co-amplify with a 16S. And so we ended up going down to something like 500 reads per sample because that's all we could get out of the lungs, right? Whereas the mouth, we get tons of reads, right? So it's not that far-fetched to think that you might have a treatment that affects disproportionately the number of reads you get out of your sample. And while that situation might not hold for every situation, I'm pretty comfortable staying with using rarefaction for my analysis of beta diversity metrics. Well, again, I would love to see if you take my code and apply it to your data, what kind of result do you get? What does your summary look like of the fraction of these 100 iterations that came back as significant? I've done this on some other data that we have from a cancer project, and it came back right about the same as this. So again, I would love to see what you get. I think that'd be great practice for you to run your own data through this pipeline. Please be sure that you subscribe to the channel and you click that bell icon. If you want to go back to the beginning of this playlist where we've been talking about rarefaction, be sure to check out this video that I have here. Take care and we'll see you next time for another episode of Code Club.