 Hey folks, if you've been watching recent episodes of Code Club here, you know that I've been looking at different ways of reading in and working with distance matrices. Now the distance matrices I've been working with were previously generated in a software package that my lab developed called MOTHER. It's used for analyzing 16S RNA gene sequence data. Now if you don't analyze that type of data, that's fine. You don't have to use MOTHER. Even if you generate that type of data, you don't have to use MOTHER. That's fine. I'm cool with it. I'm mature. I can handle it. So you might be wondering, well, how do I get a distance matrix, Pat? Well, in today's episode, I'm going to show you how we can use tools from the vegan R package to generate our own distance matrix. What we need as input is some type of file or data that indicates the number of times we see each taxa in all of our different samples. Again, I will be taking a file that we generated in MOTHER, but know that you can get any kind of count file like that from other tools or from your own observations and then bring that into R. So I'll show you how we'll do that. I'll show you how we can generate a distance matrix using ecological metrics, again, from the vegan package. I'll even show you how we can verify the data so that we are controlling for uneven sampling effort. And then I'll show you how we can take a shared file or that count file or a distance matrix directly into an ordination tool like NMDS to generate an ordination. I'm over in RStudio now and I've got my blank R script ready to go. I'm calling it distances.R. Something that you should know is that if you'd like to follow along, you can go down below into the description and there's a link to a blog post. Something that's important that I added was in the data directory. You'll notice that there's a new file called mice.shared. So you'll want to go ahead and get a copy of that because I'll be working with that in today's episode. Something that's important to recognize in that file is that not all of the samples have the same number of sequences. So in essence, the sampling effort for each sample varies. And we'll talk about that a little bit more as we go along in today's episode. I'm going to start out by doing library tidyverse so we can get all the goodness of the tidyverse loaded and at our disposal. I will do a read TSV on that shared file. So we'll do data forward slash mice.shared. And this is a really wide data frame. It's got 360 rows and 2067 columns. So there's a couple of things that we need to do with this data before we're ready to calculate ecological distances on it. So first of all, this data frame has an uneven number of reads or observations per sample. And so I want to remove those samples that don't have a minimum threshold number of sequences. And so I'll show you how we'll do that. The next thing is that I also have data in here that aren't within kind of the early and late periods that I've been talking about in recent episodes. So for example, here's day 11. I really want samples from day 0 through 9 and 141 to 150. So also this like day 125 needs to go. And then once we've removed those samples that again have not enough reads or that aren't from the time period we want, there's probably going to be OTS or taxa that don't have any observations. So I'll want to remove those as well. All right. So let's get started. So I will do a select on group and anything that starts with O2U. And I will make a column with mutate for the day. And so I'll say day equals str-replace. And we've talked about this in previous episodes just a few weeks ago on the group column. And the pattern that we will look for is anything that starts with any characters up to ad for the day. And then we'll replace that with nothing. And it's complaining because I seem to put a colon at the end of that. So I'll go ahead and remove that colon, rerun it, good. And I can't see the day because it's the last column and all this. So maybe if I want to see that day, I could kind of do a select to reorganize everything, right? So I could select group day and then everything. So everything will give me all the other columns, right? And so now I can see I have group, day, and so forth. I'm going to create a vector of the days I want. So I'll say days wanted, zero colon nine, so days zero through nine and 141 through 150. And then I will do a filter on day in, days wanted. And so now I can see I no longer have that day 11 or that day 125. So the next thing I want to do is to take my very wide data frame and pivot it longer so I can count the total number of sequences in each sample. To do that, I'm going to take this select and I'll remove it. And I will put in that filter and then I'll do another select where I will remove the day column because I no longer need that information. And I can then do pivot longer on everything but the group column. And so by default, it will create two columns for me, one named name and one named value. In previous videos, I think I put in actual names, but who needs that? So now I've got this table, right? And I've got group, name and value. And so to count the total number of sequences in each group, I'm going to do a group by a group, right? And then I will do summarize. And I will then do total equals sum on value. And this gives me a lot of output. So I've got 237 rows here. I'm going to go ahead and arrange those in ascending order by a total. And I think I will also say print n equals 20 to get the first 20 rows of the output. And so what I can see is that I've got one sample with one sequence that clearly needs to go 15, that needs to go and kind of looking through the counts, I'm kind of looking for where do I get to some critical mass of sequences. A lot can be said about how you pick this threshold. But I think I'm going to go with 1800 or 1828, I'll pick 1800 as my threshold. And so if a sample didn't have 1800 sequences, I'm going to toss it. That will result in me losing 10 samples. That's not a whole lot of samples. If I wanted to, I could go back and re-sequence these samples to get more reads. But we're pretty good for our purposes today. So I'm going to filter things out that have fewer than 1800 sequences per sample. So how do I do that? Well, I'm going to change this summarize into a mutate. And so if I look at the output, what I'll notice is that now I have a column for the total. And again, my data is still grouped by the group column. And so that's the total for F3D0. So I could then pipe this into a filter where I then say filter out anything where a total is less than 1800 up. And I want things with more than 1800, because this first one F5D145 has 234, so that's less than 1800. So more than 1800. There we go. We've got now 6229 and F3D0 looking good. Now that I've removed samples that didn't correspond to the time range I want or that had too few of sequences, it's possible that I now have taxa or these Otu's that don't have any observations. And while they're not causing any problems, they're making the data set a bit larger than it really needs to be. So I'd like to go ahead and remove any Otu's where their total is zero. And so how do we do that? Well, it's going to be very similar to what we've just done on the group column. And so instead of group, I will now group by the name column, the name of our Otu. And then we'll do a summarize, total equals sum on value. And again, this is telling me how many sequences I have for each Otu. And I'm curious if any of these total values equals zero. And sure enough, there's about 475 Otu's that don't have any counts. Now, again, instead of summarize, what I can do is mutate. And then I can do a filter where the total is not equal to zero. So now I see that I have 1588 Otu's. And so it's a paired down data set. So I'll do ungroup. And I can then also do a select to remove the total column because I don't need that anymore. So I'll do select minus total. And we're back to our three column data frame. And I can make this wider by doing pivot wider. And I'll use the group column as the ID column. So that's the column that's going to stay. And we will then take our names and make that the column names and the value column to make that the values in the cells. And so now we are left with a 227 row data frame with 1589 columns. And this is the same number of samples that we've seen previously when we were working with the distance matrix that we got out of mother. My data is currently a tibble as you see. And a tibble will not take row names. So I need to convert this to a data frame and then put the group IDs on the row names and remove the group column. So I will send this to as dot data dot frame. And I think I'll go ahead and save this whole pipeline as shared. And we look at shared and we see that we now have a data frame with a whole bunch of columns. And I can then say row names shared equals shared dollar sign group. And then I can do shared equals shared minus one to remove that group column. I can now turn shared into a matrix, which is what the input needs to be to go into the functions from vegan. And so I can do shared equals as dot matrix on shared. Hopefully that's a bit of a review of how we can use tidy verse tools, the plier, what not to go ahead and read in a file. In my case, a shared file generated by mother to get it into a matrix where the rows are my samples and my columns are the different taxa or ot us operational taxonomic units in my case. And the values are the counts, the number of times we saw each ot you in each of the samples. Again, the output here is a matrix. A matrix is a two dimensional vector where every value in the matrix is of the same type. In my case, again, they're all numerical. So now I'm going to show you how we can use this shared file to calculate distances using tools from vegan. Before we can do that, of course, we need to do library vegan. Get that loaded. If you don't have vegan installed, you need to for sure go ahead and do that before you run library vegan. So the first thing that we're going to see is that we can use veg dist on shared to calculate a distance matrix to set the type of distance you want to calculate. We could do method equals bray and running that we get out a lower triangle distance matrix as we've seen in so many previous episodes. I'm going to go ahead and save this as dist. So I can then feed dist into meta MDS to do a and MDS. We could of course do PCOA, but let's run with an MDS and I will feed that dist and I will assign this to the variable and MDS. Now, because an MDS uses a random number generator, I'm going to go ahead and set the seed. So it's consistent run after run. So we'll do set seed and I will put in my birthday. You can put in whatever you want. But if you use my birthday, you should get the same result that I do. So we'll do 1976, 06, 20. And we could then do scores and MDS so we can get the positions with score and MDS. But this is being stored as a matrix to be plotted with GG plot. I need to make it a table or a data frame. And so I will pipe this into as underscore table and I will then say row names equals. So the values in the row names I'm going to send into a group. And then I can pipe this into GG plot, AES, X equals MDS 1, Y equals MDS 2 plus geom point. So hopefully you'll trust me that the points above zero on the y-axis are from the later time points and the points below zero on the y-axis are from the earlier time points, day zero through nine. And again, up here is days 141 to 150. Now what we did again was we generated a distance matrix and then fed that into MDS. An alternative approach would be to take that shared file and put it directly into meta MDS and have meta MDS calculate the distance for us. We could do MDS, meta MDS, and I can then give that shared. So the output of putting the shared file directly into meta MDS is quite a bit a different look to the figure as you can tell. So this was generating the distance and then bringing it into meta MDS. This is doing it directly. And if you look at the output from the meta MDS, you'll see that there's a couple of different transformations and standardizations that are performed. And so I'd like to turn off that square root transformation so we can get the same appearance. So we can turn off that square root transformation by doing auto transform equals false. This now looks a lot like what we had previously when we calculated the distance using VEGDIST and brought that into meta MDS. However, something you'll notice is that this ordination looks quite a bit different than what we had in the previous episode where we brought in a distance matrix from mother and then ran it at the MDS part here within R. And so why is that? Well, in mother, we controlled for the uneven sampling effort by drawing the number of sequences in the smallest sample. So 1800 something from all of the samples, calculating a break, Curtis matrix, putting the sequences back and then repeating that like 100 or 1000 times. And so we rarefied to control for the uneven sampling effort, break Curtis is highly sensitive to uneven sampling effort. If you look at the formula, you'll see in the numerator, it's really driven by the difference in counts. And so if you have a difference in total overall counts for the two samples, you're going to get kind of weird stuff happening. And so what we need to do is to control for uneven sampling effort by rarefying the data. Now, we obviously did that previously in mother, but we can also do that here in R. And so a former postdoc in my lab, Jeff Hannigan, who now works at Merck, when he was in my lab, put in a feature addition to vegan so that we can go ahead and add rarefaction to our distance calculation. And so we can do that with AVGDIST. And so we need to alter the arguments a little bit. So we'll do D method equals Bray. And then I'll say sample equals 1800. I'm going to go ahead and remove this line 31 where I put the shared file in. You can also take these same arguments and put it into meta MDS. However, what I find is that this step can take a while. By default, it does a hundred iterations and that can take a little bit of time. And so I would prefer to separate the distance calculation from the visualization. So I'll go ahead and remove this line 31 and then run these lines. So this illustrates in part why I'd like to separate the distance calculation from the MDS. So I set a seed, the average distance does use a random number generator as does meta MDS, but meta MDS does not converge. So I need to set another seed for the MDS part. And so maybe I'll use the same seed and see if it converges now. So that didn't converge. Let's go ahead and set the seed to one that does reach convergence. And now I can visualize the data. And so now you can see what we had previously in the last episode where we have this cloud of points from days 141 to 150. And this cloud of points to the right of zero on the x-axis corresponding to days zero through nine. Great. If you wanted to increase the number of iterations, you could do iterations equals whatever. So you do a thousand. The default is a hundred. And so running a thousand iterations mean it'll take 10 times longer. I'm going to hold off on that and let you use your computer to do that if you're interested in seeing if there's a difference. There's really not much of a difference. And so typically a hundred iterations will make you pretty good to go. If you look at the documentation for VEGDIST, you'll find all the different methods that you can use for calculating beta diversity. Those methods can then be used in AVGDIST to get your rarefied distance matrix. Now, I know for some reason rarefying beta diversity metrics is a bit controversial. The reality is that our data do not have a standard number of sequences per sample. They vary wildly. Sometimes we find two orders of magnitude. And the previous ecological literature really does emphasize that many of these metrics are just really sensitive to uneven sampling effort. And the tool that we have to control for uneven sampling effort is refraction. So use refraction. Again, rarefraction is the idea that we're going to take the same number of sequences out of every sample. We're going to calculate a metric and then we'll sample again. We'll put the data back. We'll sample those 1800 again, calculate the metric, and we'll repeat that a hundred or a thousand or however many times you want. And then report back an average distance matrix in this case that is controlling for the uneven sampling effort. And so again, by taking 1800 sequences from every sample, we're controlling for that uneven sampling effort. So use these functions with your own data and see if you can get the workflow to work. See if you can read in the data, get a distance matrix from rarefying the data and then into an MDS, or see if you can take it over to a PCOA. I really look forward to exploring other elements of the vegan package with you here in future episodes of Code Club. Keep practicing and we'll see you next time.