 Hey folks, if you've been following along in the last couple of weeks of Code Club here, you know that I've been doing a pretty deep dive into various aspects of calculating ecological distances using a Bray-Curtis distance metric, dissimilarity indexes, the pedantics will correct me. Anyway, we've been looking at different approaches to calculating that distance. We've looked at not rarefangi data, rarefangi data, using relative abundance, normalization. And we've been looking at some of the critiques that people have made about rarefaction and kind of what are, you know, things that we might want to be worried about. And what I've personally found is that I've perhaps gotten a little bit hardened on that I still like rarefaction and in fact, I maybe like it a little bit more because it's not so sensitive to the difference in the number of reads between the two samples you're comparing when you calculate the distance. Well, so far we've looked at one threshold, one depth of sequencing coverage to calculate our distances. And one of the things that I've found is that when I rarefy to that threshold, the distances that I find on average between my samples is actually higher than the average distance that I find when I normalize or when I use relative abundance data. As I've shown before, there are other problems with relative abundance data and normalization that keep me from wanting to use that. But this question of the difference in the average distance has me thinking. So what I'd like to look at with you today is calculating the average Bray-Curtis distance as well as the standard deviation of those distances for a bunch of pairwise comparisons of libraries that have different numbers of reads but where we will rarefy to different numbers of reads per library. Again, so far we've been looking at, I think about 1,328 sequences per sample. Well, what if we went up to 2,000, 4,000, 5,000, 10,000 and so forth, right? So as we increase that number, samples are gonna start to fall out because samples just don't have that many reads. That's why we picked the threshold we did was because that was the size of the smallest library that we wanted to consider going forward. Heading over to our studio, let me reintroduce you to the code that we're working with. I'm in distances.r. If you wanna get a copy of this script down below in the description, there's a link to a blog post you can go to so that you can get everything I have, including the data and the code and just everything. I realize at this point, this distances.r. script is a bit of a dumpster fire. There's a lot of stuff going on in here. And so work with me. This is real life, how I might do things right where I'm kind of playing with the code and we call it spaghetti code because things kind of move in and out. And when I'm starting to do this then for publication, then I might take aspects of this out and pull things apart to make my code a lot cleaner and easier to read for people. But anyway, here we are. We've got our libraries, tidyverse, give us all sorts of good stuff for manipulating data and making plots. Vegan allows us to calculate our distance matrices using refaction or not using refaction. SRS is the procedure that we use to normalize our code. And then we read in the mouse data. So the data I'm working with is from a study we published years ago looking at the microbial community and how it changed over the first six months or so of the life of mice. And so this is getting a certain subset of the samples as well as those samples that have a desired number of sequences. Then what I did was that I randomized the data to make a null model. I basically took my O2 table that we observed and I then basically, you could think of it as moved all the individuals around. But I did that in a way so that every sample still has the same number of sequences and every O2 or every taxa has still been seen the same number of times across all those samples. Again, you can think of this as being 220 or whatever samples that were randomly drawn from the same statistical distribution but with different sampling depths. And so what that means then is that when we calculate two sets of distances between say four of our samples, those distances should be pretty close to each other. And again, that's what I want to look at is do those distances shrink or get larger as the size of our smallest library goes up. Let's go ahead and generate that data frame. And then I've got my rand group count data frame. And then here we have a chunk of code to generate our histogram of the number of counts. And so we can see we have some libraries that have as many as 30,000 sequences in them. If I want to find the range of values in there, I could go ahead and do something like range on rand group count dollar sign N. And I see that my range goes from 1,828, I think I said 1,328, up to 30,000, right? So that's about almost a 20-fold difference in the number of reads between our smallest library and our largest library. So I'm gonna come to the bottom of my script and we'll take rand, which again is a tidy data frame. And I've done this in the last several episodes. So I'll just repeat it again. We're doing a lot of exercise with pivot longer and pivot wider in these I found. So we'll do pivot wider and we'll say names from equals rand name and values from N. And then I need to do values fill equals zero. And again, why I'm doing that is because this current configuration of my data frame might be missing combinations of group and rand name. And so when it takes this tidy data frame and makes it wider, those missing values to make it rectangular, it's gonna plug in an NA value. And I don't want NA values I wanna put in a zero because it didn't exist, it wasn't there, so it's zero. And so now sure enough, I've got my group column and all my other columns. I then need to pipe this out as a data frame. And so we'll do as.data.frame and I will call this randdf. And I think I actually have a randdf defined up above but I'm sure that's annoying for me to refer to variables that you don't remember. So I'll make that here, that's cool. So then if we look at randdf, again, that shows all the columns, it's a really nasty output. We have this group column that I need to get out and I actually need to use that and assign that to the row name. So the reason we're doing this as a data frame is because tibbles don't accept row names. And so we'll do row names randdf. And we'll say that then is randdf, the first column, I could also put in group but this serves the purpose. And then I could do randdf and I'll say randdf, bracket minus one to remove that first column. And so now if I do randdf and pipe that to str and if I do row names on randdf, I now see all my sample names, they're good. So I've got randdf. In the past, I had been passing randdf into azmatrix. Some of the folks that helped develop vegan piped up and told me, I don't actually have to do that. That veg dist and AVG dist should take it as a data frame. So we'll give that a shot. But what we've done is say AVG dist on randdf, df and then d method is bray and bray is the default but let's be explicit about it. And then we'll do sample equals, I think it was 1828 and we'll then assign this back and I'll say randdist matrix. Again, I have that defined up above but we're kind of making some messy code here. And if I do randdist matrix, pipe that to str. I see that it is a distance matrix and everything looks good there. So I'd like to take randdist matrix and make it a tibble so that we can look at the mean and the standard deviation. And so to do that, we'll go ahead and we'll make this a matrix. So we'll do az.matrix because it's currently a distance matrix, right? So that outputs a matrix and then we'll go as tibble and we'll say row names equals samples. And so now we've got samples as the first column of our data frame, right? And so then we can then pivot longer and we'll pivot longer calls equals everything but the samples column and so that looks good. And so now what we'll do is we will summarize to get the mean and the standard deviation. So we'll do mean equals mean on value and then SD to be SD on value. And so we get a mean and a standard deviation. And so that I don't have to create a whole bunch of different objects. I'm gonna include this in the pipeline and instead of calling it randdist matrix, I'm gonna call it randdist summary. And then if I look at randdist summary, I see the same values. I guess the standard deviation is off to by dist, a smidge. Again, that's the product of a random number generator. Let's go ahead and just experiment. Let's put this up to 5,000 and see what we get. Actually, I'm gonna go ahead and remove that randdist summary because I don't totally need that. So we get a warning message that a whole bunch of samples removed because they're below the sampling depth. And so then we get back a mean and a standard deviation. And sure enough, that mean is lower than what we had before. So I wonder if as we increase the sampling, we're gonna continue to see that happen. So I'm going to go ahead and add an end column here so we can count the number of samples that we're seeing. And what I'd like to do is turn this into a function. I'll call this run rarifybray and it'll be a function. And we'll throw in x as the size of the library that we want to use. And we'll have avgdist and we'll make that the body. And I think this should work actually. It's going to take randdf from being defined outside of the function. So if I go ahead and run all that and then I call run rarifybray and let's do 2,000. And I'm getting this end of 21609. That's the number of comparisons, not the number of samples. So I don't think I want that n. And I think instead what I will do is I will use the rand group count and I will then filter that for n greater than or equal to x. And I realize that I have 5,000 hard coded in here and that should be an x. Why didn't you all say something? Anyway, so I'm gonna make this my mean sd as the output of that. And so this will be a table that has mean sd and my x. I'm gonna, for now, practice with it being 2,000. Let's run mean sd again. And then, and so then that will give me mean sd being those two values. And then for my filter, let's go ahead and make that n row. And so that'll be 225 samples. And so I'm gonna call this n. I'll do bind calls of mean sd and n equals n. So I'll go ahead and run that and then bind it. And so now I get this table as output. And so what I'm gonna do is I'm going to use one of the map functions to iterate over a bunch of thresholds. And so then each threshold will spit out this row. And so then I'll have a data frame with this mean and standard deviation and that should be good. So I'm gonna use the map function. I'll use map dfr and we will go from, let's say, 1,000, 2,000, 3,000. And we will then iterate over the name of the function, run rarefibray. So I need to go ahead and load my function and run my map function. Again, what the map function is doing here is it's taking that function, run rarefibray and applying it to values of 1,000, 2,000, 3,000 as that x parameter. It should then output the output, those three values as separate rows in a new data frame. All right, so we get our warning messages that we expect. Our mean drops, but we don't have an indication of how many sequences we're taking from each sample. So I can rectify that in my map dfr by doing id equals n seeks. And so let's run this, I'm sure it'll work fine. All right, so at output did 1, 2, and 3, which are basically the seats in the vector 1,000, 2,000, 3,000, that's not really what I want. I'm gonna go ahead and remove that id seeks and I'll add the number of sequences to my bind calls. So I'll do n seeks equals x, yeah, that'll work. So I need to go ahead and reload my function and run this again, hopefully it works. Again, I'm doing this with a small number of thresholds because if I was doing this for 20 or 30 thresholds, it would take much longer and I would be doing a lot of iterations of this and that would get really annoying and my computer fan would just get louder and louder and louder. Anyway, so it's good to test at the small level, make sure things work and then scale it up bigger and bigger as you go. All right, so that did exactly what I'd hoped it would do. We're looking good. All right, so again, I wanna do a lot of different thresholds. So let's replace this. And again, looking at my histogram, I probably only wanna go up to about, I don't know, 15,000 sequences per sample. So let's do seek 1000 to 15,000 by 1000. So it's by 1000. So I could go ahead and run this. It's gonna be 15 steps. That's gonna take longer than I really wanna sit around waiting for. So what I could do instead is to use the fur package. So map DFR comes from the purr p, a bit of a cold, I don't know if you can hear that purr package. Fur allows you to use the future package with these map functions. And what the future package allows you to do is parallelize this. So if I've got 16 different levels and I've got 16 processors on my computer, then it should do it pretty quick, right? Because it's gonna run one threshold on each of the processors and then pull it back together. To do that, I need to do library fur. You need to make sure you have it installed. Again, normally I would put library fur way back at the top of the script, but it's okay here for right now. As we'd kind of make the script a little bit more mature and clean, that's when I would move things around and stop all the redundant code. And then I needed to say plan multi-session. What that does is that basically is telling our where to fire these jobs off, how many processors you have and kind of getting things ready to do that. I'm gonna then call this rarefy bray results. Then we need to do future map DFR. There's an argument we can add to indicate the progress. So I can do period progress equals true. And so again, what this is doing is this is putting a single threshold on each processor. My laptop I think has 16 processors. And so I'm asking for 15 jobs. So I'll have one extra processor. If I look at my top output on my computer, I can see that I've got multiple cases of R running here. And actually they're falling off one at a time because it's kind of working through each of those 15 different jobs. And so this is running a lot faster than if I had run them all in series. For some reason, this progress bar doesn't quite work. I see there's 29 errors, warning messages rather. If I do warnings on those, I see that a lot of the warning messages are about samples falling out because they didn't have, you know, say 5,000 sequences in them. There's also a warning message in here about unreliable value because there's a random number generator being used by AVGDist. When I've looked things up on documentation for future map from the FUR or future package, which it depends on, tells me not to worry about it. So I'm not gonna worry about it. So let's go ahead and look at rarefie-bray results. And we see, sure enough, we have number of sequences, the mean, standard deviation, and the n, the number of samples in the study that had that many sequences. So let's go ahead and plot this and see what it looks like. So we'll go ahead and rarefie-bray results, ggplot, aes, on the x-axis, I'm gonna put n6 on the y. I'll put the mean. And let's go ahead and send that to geomline. And what we see is, again, as the number of sequences being considered increases, the distance between sequences drops, right? And again, we can set zero as the bottom on our y-axis by doing chord cartesian, y-limb, zero to na. And so we see it kinda does plateau or it's not such a dramatic fall-off in those distances. I would expect with infinite sequencing that this would go down to zero because, again, there really is no underlying difference between my 225 or whatever it was, different samples. Let's go ahead also and look at the standard deviation. And we saw this before in the last episode that we could go ahead and put these three columns into the same column by doing pivot longer. So I'll do pivot longer and we'll do calls of mean, sd, and n. And that works, right? And so we can then pipe that into ggplot on the x-axis. I want n-seqs, the y-axis, I want value. And then I'm gonna do plus facet-wrap tilde on name and I want n-row equals three, scales equals free y. Again, we can reorder this. I might go ahead and do that by using the factor function so we can mutate name to be a factor on the name function and then we can set levels to be, let's put n first and then the mean and then the standard deviation. So what these data tell me is that I, of course, would prefer to have more sequences per sample. I mean, who wouldn't want that, right? On some level though, if I don't have as many sequences, what do I do with that, right? How do I deal with that, right? And the way I always think about it is that if I have two samples that should have a distance of zero, but I'm getting 0.15 because I have shallow sequencing or just not as deep sequencing as I might like, that that kind of tells you how big your effect size needs to be. How big do your two different treatments need to be to detect them as being statistically significant, right? So that's the way I think of it, right? So if I don't have a lot of sequences, then I need a larger effect size to see the differences. Again, there's lots to consider when you're picking that threshold for the number of sequences per sample. I generally don't think about this. I generally think about how do I get more samples in because that's also a factor that will affect your statistical power. The more samples you have, the easier it is to see a difference that's actually there. And so again, there's all these trade-offs between numbers of samples and the number of sequences per sample. That if you increase the number of sequences per sample, you're gonna decrease the number of samples. So what do you do? Well, I have ideas, but we're running out of time for today. So stay tuned because in the next episode, I'm going to build upon this type of result to think about what effect does refraction or not refraction have on our propensity to make false positive results, to run a statistical test using something like Adonis and to accidentally, incorrectly call a difference as being statistically significant when it's not. And how is that risk mediated by whether or not we verify, use the relative abundance or normalize our data. Be sure that you've subscribed and you click that bell icon so you know when that episode drops. I can't wait to share that with you. It's gonna be pretty cool. Also, if you're trying to catch up and some of this is new to you, I encourage you to check out this video I have over here, which will help you get all caught up in what we're doing here in Code Club.