 Hey folks, in recent episodes of Code Club, I've been doing a lot of work looking at how different approaches to correcting for uneven sampling effort affects ecological dissimilarity values, things like Bray Curtis, right? So one of the big problems we have is that when we sample communities, say like microbial communities like I am, we don't get the same depth of sampling on all of our communities. And what we've seen is that if we don't correct for that, we get really big artifacts that are driven by the number of observations we have in each sample. We've looked at rarefaction, which is a way to grab the same number of observations from each sample, calculate your distance, throw them back in, repeat that like 100 or 1000 times, then you get back an average. We also looked at normalization, so that every sample has the same number of observations. We also looked at relative abundance, where you basically take the number of observations you have for each population, each species, whatever you want to call it, divide that by the total number of observations for each sample, and then you can base your comparisons. And what we've seen is that rarefaction wins. Not rarefying, normalizing, and using relative abundance are all basically losers, right? So if we plot the distance between pairs of samples, samples that are really a statistical sampling of an overall meta-community, what we find is that the distances between samples change depending on how many observations there were in those two samples. Well, there's two other approaches that have been described in the literature recently for the microbiome field, at least, that I want to spend the next two episodes exploring. I know people are probably getting sick of hearing me talk about this, but even if you don't care about this question, I really hope that you keep watching, because I think there's a lot to see about how I go through my thought process of taking a hypothesis and testing it using our code. In today's episode, we are going to build on an observation that was made about microbiome data and kind of ecological data in general, I feel. And that is the compositional nature of the data. And so when we do a sequencing run or go out into the field and count different species in a forest, the number that we see there isn't exhaustive, right? It's a sampling. And at least for sequencing, the number of reads I get off the sequencer really doesn't have any basis in reality, because I've done some normalization procedure of the DNA going into the sequencer. And so the number of reads I get is not related to the total biomass or the total number of taxa in that sample. And so what that means is that if one population's abundance goes up, then because it's all relative, another is going to go down, right? And so the different abundances of those populations are not independent because, again, we're normalizing the amount of DNA going into the sample. And because they're not independent, they're compositional. So the primary approach that people use to deal with compositional data is what's called the central log ratio, CLR. And so you can calculate the central log ratio for each of your taxa across all your samples, and then use those transformed data to calculate what's called the Atchison distance. So what that really is is that if you take your CLR data, the transformed data, and calculate Euclidean distances between your samples, you will then get the Atchison distance. And so this has had some pretty big proponents in the microbiome field. And one of the things I've noticed is that they make the claim that CLR approaches are scale invariant and subcompositionally coherent. And so what they say this means, then, is that there is no need for rarefying the data to deal with uneven sampling effort. I have my suspicions that that's not quite true, but hey, we have a great tool here in R that we can use to test this out. Along the way, we'll run into a couple problems, and we'll see how people deal with those problems, and see how those solutions may or may not affect the downstream results of whether or not these Atchison distances are invariant to differences in sampling effort. So I have created a new R script atchison.r. If you want this script and all the other scripts and data in this project, down below in the description, there's a link to a blog post that will get you the starting coordinates and GitHub of my project. I've also got some instructions up here to help you get going with that. I'll go ahead and generate the shared data frame. So shared is a tidy data frame with a column for the group designation or the sample that each observation came from. Name contains the name of the OTU or the taxa. And then value is the number of times we saw each OTU in each sample. Then what I did was I created a null model that's called rand. And so rand is also a tidy data frame, which again has the same structure except we now have a column called rand name. And so what we've done with rand is to generate a null model where every sample is basically a statistical sampling of a larger medic community described by shared. Basically, what we do here in rand is we take all the data and shared, we pull all the samples together, then we sample without replacement from the composite data to regenerate samples that have the same number of observations that they did in the original samples. So again, rand has, again, about 225 different groups, each of those groups has the same number of observations as the original did. So I'm going to take rand and we're going to go ahead then and let's start by calculating the number of observations in each of the samples. So I can do rand piped to a group by group. And then I can do summarize. And I can do n seeks equals sum on n. So now I have the total number of sequences that were observed for each of the samples. And I can call this group count. And if I take group count, and pass that into say range, and I guess I need to do group count, dollar sign, and seeks, I see that my range is from 1828 sequences up to 30,505 sequences. So we have almost 20 fold variation in the total number of sequences between our different samples. Again, that is part of the motivation for this line of inquiry. But know that this range of differences in number sequences really isn't that uncommon. And we've even seen larger variation. So the first thing that I want to do is I want to create a data frame that can go into veg dist for non rarefied distance calculations, basically our control. So I can take rand. And rand again is a data frame with those three columns. And we can then pivot wider. And I can do names from equals rand name, and values from equals n. We now get our wide table. The input to veg dist or AVG dist, the rarefied version of veg dist needs to be a data frame. And so I can then pipe this into as dot data dot frame. Again, we now get the data frame. I'll call this rand df. And rand df also needs to have that first group column removed and actually put as the row names to be used in veg dist. So we'll do row names on rand df, equaling rand df dollar sign group. And then we'll also remove that column from rand df by doing rand df equals rand df. And then comma minus one to remove that first column. I'll go ahead and calculate the non transform distance matrices. And so I'll do no rare dist as a veg dist on rand df. And then method equals Euclidean. So I'm getting an error message that there are missing values, there's na's in my data frame. And that's because when I did the long data frame and pivoted that wider, there were combinations in the long version that were missing to populate the values in the wide version. So to fix that, I need to do values fill equals zero. And that clears everything up. And I'll also do a rare dist using AVG dist on rand df. And I'm going to do D method equals Euclidean. And I'm using Euclidean here for both the no rare and the rare on the non transform data, because when I transform the data with CLR, to get the adjacent distance, that's going to use the Euclidean. So I don't want to compare break hardness to Euclidean. In this case, I want to compare Euclidean to Euclidean. I also need to add a value of sample, which is the size of the smallest data frame. And so I'm going to go ahead and use group count dollar sign and seeks, but I'm going to use min on that to get that 1828, I think it was, let's double check, we know what it was. Yeah, 1828. Great. Now we've got a non transform distance matrices stored away, and we'll come back to those by the end of the episode. Now we're ready to talk about doing the CLR transformation. So as I mentioned, the CL transformation requires a geometric mean. The geometric mean is you can think of it being the product of a bunch of counts raised to one over the number of those counts, right? And so if I, so I could do like two times three times four, and the geometric mean then would have me raise that to the one third power to get a geometric mean of 2.884499, right? So a way that's algorithmically easier to calculate this, so you calculate the log of those three values, you then calculate the mean of that, and then you raise that to the exponential function. So in other words, what we could do would be to say exp mean of log, and then we can give that 234 to get the same thing, right? So this is actually computationally algorithmically more robust than doing it this way. The problem with this first approach is that you multiply a bunch of numbers together, you very quickly get a number that's much larger than what your computer can store. And so if you log the values first, and then calculate the mean and then raise it to the exponential function, the numbers are much better bounded. So I'm going to use this as the basis for a function that I'll call GM. And so I'll say function, and we will go and give it X as a vector. And then I will plug in these values. And so instead of C 234, I'll put make that X. And yep. And so now I can do GM on C 234. And I should get back 2.88. Sure enough, we do. And so now we have the tool for calculating the geometric mean within each of our O2 use. Wonderful. That works great. Something to think about, though, is what if we had 2340, right? So our data is going to have a lot of zeros in it. And so you might wonder what's going to happen with the geometric mean when we have a zero value. So again, we get the zero, because as you can think about the geometric mean, we're multiplying these four values times each other, and raising them to the one fourth power. Well, if you multiply anything by zero, it becomes zero, raise that to the one fourth power, still zero, right? So again, the geometric mean becomes the denominator when we're calculating that CLR value. So that if we took the log of two divided by zero, it's going to come back as infinite, which is going to be no good, right? So we need to make sure that we don't have any zeros when we're calculating the geometric mean. Well, one thing we could do is we could make sure that our data doesn't have any zeros coming in, right? And so that's going to be an approach that we'll talk about for imputing zero values. An alternative approach is what's called the robust CLR, where you basically remove all the zeros before you calculate the geometric mean. And so to do that, we could look at x, the vector x, and only look at those values where values within x are greater than zero, whether or not equal to zero, right? So if I go ahead and load this, and then I recalculate this geometric mean, I'm going to get back to 2.88, right? So this is an approach that has been used in microbial ecology literature, again, called the robust CLR. So let's start by doing the robust CLR. And so I'll say rand. And again, to remind us rand is this three column data frame. And we're going to want to group by the rand name, right? And we'll then pipe that into a mutate. We'll say our CLR for robust CLR equals the log of n divided by the geometric mean of n. And so what we see then is that we have our RCLR column, and we've got negative and positive values. So remember what we're doing is we're looking at the ratio of the number of counts divided by the geometric mean of those counts. And so numbers of counts below the geometric mean will be a number between zero and one. If you take the log of a number between zero and one, you get a negative value, right? If it's above, you get a positive value, right? So I'm also seeing that I'm still grouping by my rand name. So I'll pipe this to an ungroup to remove that grouping variable. And I will also then go ahead and remove my n column. So we'll do select minus n. And so we can then use this to pivot wider to get our data frame. So we can do pivot wider. And again, names from will be rand name values from will be RCLR. And then values fill will be zero. So if there's any NA values in there, we'll make those zeros. And then we can pipe this to as dot data dot frame as we had up above. And I'll also call this RCLR df. And then I'm going to grab these lines up here, where we're dealing with the row names in that group column, right? And so again, instead of rand df, I'm going to do RCLR df. And then I can again do my veg dist on that are the robust CLR. And I'll just change no rare to RCLR. So again, this is the robust CLR, where we basically are ignoring those zero counts and calculating the geometric mean. So people in gene expression will frequently impute these zero values. And so this method has been ported over to the microbiome literature as well. I think there's a big difference though that people aren't mindful of between gene expression and microbiome data. With gene expression, they assume that a zero is basically below the limit of detection. But it's perhaps still there. With microbiome data, because our data are really patchy, it's far more likely that those zeros are absence, they're true zeros. Now, the data I'm looking at from mice, we serially sample these mice over a long period of time. So if there's a data set, that's more likely to have zeros that are really below the limit of detection, it's going to be this data set. Whereas if we looked at perhaps more cross sectional study of humans, like my other cancer data set we've worked with in the past, I think a lot of those zeros are far more likely to be structural zeros, where there's zeros that really mean that the population just wasn't in that sample. So to impute the zeros, we're going to come back up to the top and we're going to use a library called Z compositions, which if you haven't installed already, you want to make sure that you do. So as I load this library, I noticed that there is a function that is being masked. That's actually really important to me. So the select function is being masked by attaching the mass packages. So Z compositions uses this mass package, the mass function has a select function that's used more for fitting models. And it's not something that I'm really concerned about. I would far rather have the tidy version of select, rather than the mass version of select. So what I'm going to have to do is change the order that I read in my libraries. And basically, I'm going to take this library tidyverse and make it third in the list. But to get this to actually work, I need to quit our studio and start it back up again. So give me a second and I'll do that. All right, so I reopened it ran everything and everything looks good. If I come back up to the output way back up here, I see when I attached the library tidyverse that it tells me that select masks, masses select, so we're in good shape. So we're going to use that Z compositions package to help us to impute values for our zeros. So we don't have to worry about zeros in our geometric mean. So do this imputation of those zero values, we'll use c molt, our EPL, and the input data frame is going to be our rand df. And so it wants the observations or the different taxa in the columns, and then our samples in the rows, which we already have. So that's great. And then there's four different methods, gbm. And we also want output equals p counts. This will give us our counts rather than proportions. So when I run this, I get a warning message and an error message that gbm method, not enough information to compute the t hyper parameter, they're probably columns with less than two positive values. So in fact, we have a bunch of columns where they may only have one, one sample has that to you. And so the gbm, the default method isn't going to work. And so what we can do instead is use one of the other methods. There's three other methods. I'd encourage you to dig into that literature if you want to learn more. I'm going to use the first option, which was CZM run that again, I get this warning message, but no error. It's warning me that there's columns that contain more than 80% zeros on observed values. So again, regardless of the method I'm going to use, that's going to come up. You might be saying, why don't you remove the singletons? Well, go check out my preprint about why we don't want to remove singletons. If I look at ZCLRDF with str. So if I look at my output here, I noticed something interesting that there are some columns like ot00101 that are type int, right? And so these are whole numbers. And then there's others like one or two, which are type number, which are floating point numbers. And so you'll see like 0.148, 0.148, those are zeros that got imputed to other values. And there's another one here, right? So now we can use this ZCLRDF as input to veg disk like we had up here. And we'll go ahead and take ZCLR there. So now we have our four distance matrices, we have the non rarefied Euclidean distance, we have a rarefied Euclidean distance, we have the robust CLR and the ZCLR. Both of the CLR are not rarefied, they're based on the raw counts. So now I'd like to make a plot where on the y axis, we have the pairwise distance. And on the x axis, we have the difference in the number of sequences between the pairs of samples that each point represents. So I'm going to gather together my four distance matrix object values just so I remember what I called them all. And I'm going to process them in parallel with each other. So no rare here. And then there's also rare, right? So I've got these four distance matrices. And again, if I look at no rare dist, it's a distance matrix, I can convert this to as dot matrix, and then as tibble, I'll say row names equals sample. And this then converts my distance matrix into a tibble, of course, it's square. So I need to then pivot longer. So I'll pipe this into pivot longer. And I will do calls equals minus sample. And I see that that will make my three column data frame. And so that's great. One thing I want to be sure to do is to remove the upper triangle. And so I'll do filter, name less than sample. That will also get rid of the self comparison where I had zero distance values. So I'll call this no rare DTBL as the non rarefied distance matrix tibble. And we'll get that loaded. And again, I can grab all this code. And I'm going to copy it for each of these distance matrices, not the most dry approach. But hey, it works for kind of exploratory purposes. I'll go ahead and then change that to rare DTBL. And then here it's going to be our CLR DTBL, and then Z CLR DTBL. And we'll get all these loaded. And so I need to now join those four data frames together. I'm going to go ahead and do that with an inner join. So the inner join with no rare DTBL and rare DTBL. And we'll join by two columns, right? So it's going to be the sample and the name column. And so we'll do sample, and name. And so now we get value x value y, we'll go ahead and clean those up in a moment. But the first is no rare, the second is the rare. And then we'll do another inner join with the data coming through the pipeline, and the robust CLR DTBL. And we then want to join by sample and name. And we'll repeat that with the Z CLR DTBL. And I'll remove that pipe for the time being great. So now we have our four columns, and they're in the order that we joined them together. I also want to do two more joins where I bring in the count. So you'll recall way back up here, we had the count. So the group count, right? And so the group count had the group and the n seeks column. And so we'll go ahead and do an inner join of the data through the pipeline with group count. And we'll join by we'll do sample equals group. Good. So we've got that column. And then we also want to repeat this for the name column. Right. So instead of sample equal group, we'll do name equals count group. And so now we have all these columns. And let's do a mutate to get the differences. So we'll do mutate diffs equals n seeks x minus n seeks y. Good. And we want to make that the absolute value. Right. Good. Okay, now we want to clean everything up with our select. So we'll do select, sample name. And then we'll rename these to do no rare equals value dot x, rare equals value dot y. Again, value x is this first data frame value y is the second data frame, right. And then we'll do our CLR equals value dot x dot x, and then Z CLR equals value dot y dot y. And then we also want diffs. So now this is our composite data frame. I now want to pivot longer, so that I can have each of the four methods in a stack. So I can make a faceted plot with four different plots, right? So we'll go ahead then and pipe this into another pivot longer. And the columns that we're going to pivot longer, we'll specify with the calls function. And we'll put in their no rare, rare, RCLR and Z CLR. So now I'm getting an error that I've got name in two locations, the default value for names to is name. And I've already got a name columns. So I'll go ahead and clean that up by doing names to equals method values to and I'll say dist, I'll go ahead and pipe this to gg plot and aes on my x axis, I want the differences on my y axis, I want my dist. And I'm also going to make this then a geom point. And we'll go ahead and facet wrap tilde with method. All right, so by default, it makes two rows and two columns. But I think we can already see that again, the no rare is a bit out of control, as is the Z CLR. So let's go ahead and make this n row equals four, so that we have a stack of them. And we'll go ahead and do scales equals free y. Again, that scales equals free y frees up the y axis to vary by the variation in the data for each of the panels. And what we see is that the no rare and the Z CLR seem pretty much unbounded, it doesn't really seem like there's any difference in those two approaches, which is a bit troubling rate. This R CLR also seems to have a dependency on the difference in the number of sequences between samples. The rare seems again, to be pretty steady at about a value of about 60, like we saw with breakers. I'll go ahead and add a geom smooth to see if that variation is what we think it is. The geom smooth allows us to see the trends in the data. These are pretty obvious for most of these. The rare, again, I just wanted to double check that that is a pretty flat line there. And so I'm pretty happy with that. Again, I do not see doing these at just some distances based on the center log ratio data improving anything. I do not see the scale invariance that other papers have talked about. There is definitely a scale variance that as you increase the difference in the number of sequences between the samples you compare, the distance goes up. So that is troubling, because as we saw in the last episode, if there's a confounding between the sampling depth and the treatment group, then you will artificially see differences in your data. It should also decrease the power that you have, right? So if you're artificially increasing the variation in your data, because you have differences in the number of sequences, that's going to decrease your statistical power, making it harder to see differences. Again, I don't see anything here that sways me from rarefying the calculation of my brake hurtus distances. Sure, the data are compositional, but correcting for that compositionality doesn't get you away from the problem of differences in sampling effort. Perhaps you could go back and rarefy your data, then do the CLR and then calculate adjacent. But in the end, I don't know what that really gets you. No one's really proposed that. Again, the motivation for me talking about this today was the claim that we would not need to rarefy our data if we use the CLR compositional-based approaches to get something like an adjacent distance. I just don't see it. Maybe I'm doing something wrong, which is entirely possible, and I invite you to let me know. I also invite you to take what I'm doing here and apply it to your own data. See if you find the same types of trends that I'm finding. I would be delighted to hear about that. I've gotten a little bit of feedback from people, and I'm really grateful for that. Anyway, play around with these methods. I hope you find it again useful to see these different functions out in the field. I mean that we did like five, six different inner joins here. Select all these great functions over and over again in different contexts. I really hope that it's helping you to see how I go through a problem and use these great functions in all sorts of different contexts and how I troubleshoot different things. Anyway, even if you're not interested in this general controversy, I think there's a lot for you to learn here about programming in R. We'll see you next time for another episode of Code Club.